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SELF-CULTIVATION AND THE CREATIVE ACT: ISSUES 
AND CRITERIA! 


HAROLD RUGG 
Teachers College, Columbia University 


1 


No contribution of the child-centered schools is greater than the 
discovery of the principle that only an artist-teacher can discover 
and develop the artist in the child. Only one who has lived through 
the art experience can provide art experiences for the children; that is, 
one must himself have the attitude of a creative person in order to 
develop children into creative persons. This is the important generali- 
zation that emerges from the scores of creative artists in our new 
schools during the past fifteen years. 

But the cultured person is both pragmatist and artist, both a 
maker and doer and a creative appreciator of life. The complete 
educational program, therefore, must embrace the attitudes, concepts 
and techniques of the creative and appreciative individual as well as 
those of the problem-solver. Now to develop the instrumental values 
and activities of life a monumental library has already been produced 
under the leadership of Mr. Dewey. But of the criteria for the 
education of the Man-as-Artist little has been said. 

It was to the concepts of the artist that we turned in our attempt to 
sketch a balanced portrait of the integral person. In him we perceived 
the guiding concept: The integrity of one’s ownself . . . one’sauthen- 
tic inner truth as the true criterion for judgment and for conduct . . . 
admiration for every well-thought-out personal philosophy. It is 
the Man-as-Artist who is sensitive to the criterion of integrity, who is 
dominated by the attitude of appreciative awareness. The true 





1 Chapter XIX of the author’s forthcoming book, ‘‘Culture and Education in 
America.”” Harcourt, Brace and Company, New York. (In press.) 
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craftsman is he who stresses Feeling-Import, who gives creative 
desire a place coordinate with intelligence. He sees man whole and 
in turn visualizes the nation as a multitude of integral persons. Corre- 
sponding, it is the artist in the school who is concerned with the 
production of persons, not primarily with developing professional 
poets, painters, actors, thinkers, or musicians. Hence the source for a 
psychology of the creative act lies in the experience and the vision of 
the artist himself. 

In addition to possessing a sensitiveness to the norms and criteria of 
the creative attitude there is another fundamental qualification for 
serving as the artist-teacher: That is, the sensitiveness to the potential 
artist in the child and his methods of work. The development of 
creative education in the schools depends more crucially upon the 
adoption of the ‘“‘drawing-out”’ attitude by the teacher than it does 
upon his mastery of technical knowledge and skill in manipulating the 
materials of one art. The creative teacher has this attitude. He 
regards every child as a potential artist, each in his own modicum of 
creative power. Hence, the artist is a sensitive listener to childhood. 
As Miss Levin puts it: ‘‘ You have to feel the thing the child wants to 
do, to think his thought, in short, become a child yourself.” 

The educational conclusion from such postulates is clear. In 
order that children shall become self-expressive craftsmen with 
words or tone, with clay, wood or stone, with light and shade, the 
teacher must be a craftsman with those materials. Thus, and only 
thus, can he become sensitive to the potential person in the child. 
This was the epoch-making message of Hughes Mearns. It was like- 
wise the stimulation given by Satis Coleman in her “‘creative music.”’ 
So the theme recurs with the materials of painting, sculpture, the 
dance, or the drama. Through each medium of self-expression 
children develop the ability to express themselves honestly, creatively, 
and to grow as persons only to the extent that the teacher’s attitude 
and procedures provide for them. 

Thus, the significance of the entrance of the artist into the school 
is as great as his discovery of the artist in the child. It is through 
creative production in the school, in the home, in every social agency 
that children receive practice in developing the integral of one’s self. 
They learn to measure themselves against critical inner standards. 
Children must ask: Is this poem I have written really ‘“‘I’’? Is this 
house I have made, this music I have played, utensil I have con- 
structed, brief I have prepared, oration I have delivered, as close an 
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approximation of my true self as I can makeit? This is the criterion 
which the artist ruthlessly applies to himself, and this is what he has 
contributed to the creative phases of the new education. 

He has recognized attitudes, points of view, and techniques which 
are indispensable in the education of the cultured man. Thus he 
offers an important contribution to general education as well as to the 
development of potential artists. 


2 


The scientific attitude is analytic, and the method of work con- 
centrates upon the systematic collection of facts measured on scales of 
equal units, the discovery of recurrences, uniformities, ‘‘law.”’ 

The artist’s attitude, however, is integrating, all-embracing, 
appreciative. The artist “‘measures’” but against the unique inner 
norms of his own peculiar experience, not against external standards. 
Correspondingly, his goal is nonuniformity, the unique individual 
thing. Whereas science emphasized the adjustment of the individual 
to an external norm and seeks the confirmation of generalizations, art 
on the contrary shuns repetition and denies the possibility of confirma- 
tion or refutation. 

These distinctions set the stage for our study of the difference 
between the methods of work involved in three essential kinds 
of activity: (1) Intelligent problem-solving, (2) creative produc- 
tion, (3) appreciation. Phrased another way, the question amounts 
to this: What are the likenesses and differences among these processes? 
Students of the creative act maintain that there is a difference between 
the process of problem-solving (in which assimilation plays the leading 
réle) and that of creative self-expression and contemplative awareness. 
The instrumentalists deny this. They maintain that the assimilative 
act and the creative act are merely differing aspects of the same 
general procedure of learning. Always protagonists of the unity of 
experience, they maintain that those who distinguish between “‘assimi- 
lation” and ‘‘creation” are resorting to a dualism. Since the discus- 
sion throughout the book has emphasized the unified character of 
human experience, that criticism does not apply in the present instance. 
The distinctions which are now to be pointed out are of another type. 

At this point we must remind ourselves of the current tendency to 
apply the word ‘‘creative’ to any kind of active vigorous learning. 
Such a tendency is conspicuously evident just now among educa- 
tionists. Lectures by professors of education and others subsume 
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under the caption ‘“‘creative’’ the most obvious kinds of repetitive 
learning, mastery of skills, and acquiring of information. Nothing 
but confusion can come from such a careless use of meanings and 
vocabulary. 

Next, a word concerning the data and the method of my analysis, 
The data are the subjective materials of experience, and the method 
is that of introspection, or rather retrospection. We are studying the 
mental and emotional experience undergone in the creative act. It is, 
therefore, only by the introspection of the creative artist that the 
experiential data of the process can be explored. No person who 
has not experienced this process can generalize concerning it, and no 
objective measure of products can lay bare the process itself. 

Hence my description of the creative process is based upon occa- 
sional flashes obtained in the autobiographical literature of creative 
artists and creative scientific men, from similar examples of intuitive 
understanding obtained in conversations with such persons and from 
my own introspections. (In earlier years I had prolonged contact 
with the processes of the scientific method both in physical and 
intellectual technology. In recent years I have devoted considerable 
energy to the creative arts.) 

The techniques of introspection and retrospection must not be 
despised. They are the only possible means of exploring ‘“‘ processes’’ 
of learning. ‘They are indeed the techniques used by Professor Dewey 
in his famous analysis of ‘‘the complete act of thought’’! (that is, the 
complete act of problem-solving, for Dewey is considering only one 
type of thought). The validity of that subjective analysis was 
established, of course, by the confirmation that came from the cumula- 
tive, introspective analyses of other students of problem-solving. The 
validity of my present analysis of the creative process must be estab- 
lished in the same way. I am only too well aware of the difficulty 
which the creative artist encounters in giving utterance to the kaleido- 
scopic succession of his complex inner states. Nevertheless we must 
attempt the task and to do so we must put ourselves in the attitude 
of the artist. Only so can we make a valid record of our experiences. 

Furthermore, as I said before, no objective measurement of 
products or of overt behavior will portray the inner process itself. 
That is clearly illustrated in the monumental work of Thorndike and 





1 Dewey, John: ‘‘How We Think.” D.C. Heath and Company, New York, 
1913, Chap. V. 
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Judd in reading, and of Freeman in handwriting. Thorndike meas- 
ured the product of reading; Judd and Freeman measured the overt 
signs of the processes themselves; that, is by the photography of eye- 
movements in reading, by hand and finger movements in writing. 
But even the latter did not record the experiences which produced the 
behavior of the product. What they are can only be inferred and, 
indeed, only approximated by the careful introspection of the person 
himself. , 


3 


What, then, is our task? It is the analysis of three ways of 
knowing: Problem-solving, creating, and appreciating. Let us begin 
with an analysis of the four aspects of the learning process: First, the 
attitude of orientation; second, the initiation of the act; third, the 
ongoing process itself; fourth, the definiteness with which the achieve- 
ment of the goals can be ascertained; that is, the possibility of confirma- 
tion or refutation. 

In analyzing these we shall postulate the unified nature of experi- 
ence. We shall conceive of each human response as an integration of 
physiology, intellect, and emotion. No dualistic separation of 
faculties or elements of response is implied in my discussion. On the 
contrary, it is assumed that the organism responds as a unit, and 
every act is conceived as a fusion of sensori-motor set (attitude), of 
meaning, of generalization, of language, gesture, and other overt 
movement. Life is a succession of infinitesimal experiences, each a 
complex weld of the many kinds of process with which the human 
organism has the capacity to respond. The human act is not mosaic; 
it is fusion. 

Nevertheless, our argument is that all human acts are not alike. 
For example, the tendency to flee from a situation is unlike the tend- 
ency to embrace it and we give different names to them. The 
emotion of anger is unlike the emotion of love and their descriptions 
vary accordingly. One act is oriented by one attitude, another by a 
very different one. One makes much use of meaning and generaliza- 
tion, is predominantly intellectual. Another, however, is predomi- 
nantly motor response, calling into play but few intellectual elements. 
Still another may be hyper-sensory, involving little overt movement 
but much internal kinesthetic response—a gathering-together process 
highly charged with emotion. Thus the composition of human acts 
varies greatly. Hence, in exploring the acts of problem-solving, 
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creating and appreciating we shall expect to find similarities and 
differences in their constituent elements. 


4 


Consider, first, the attitudes orienting the act of problem-solving. 
In confronting a problem, the worker is oriented outward. The 
conditions of the problem are ‘‘given.’’ Here are some examples: 
One has to find the distance between two points; to discover the 
combination of elements which make up an unknown; to design an 
engine of known-in-advance horse power, cylinders, and the like; 
to determine the most favorable distribution of practice required to 
produce a desired amount of skill; or to determine the dimensions of an 
I-beam which will withstand known-in-advance loads. 

In each of these ‘‘problems” the attitude is set in reference to 
external needs. To grasp the problem, the individual must adopt the 
attitude necessary to understand the conditions set by it. As Dewey 
has said, unless the individual can perform an indicated set of opera- 
tions, he cannot respond with its meaning; that is, unless he adopts 
the attitude appropriate to the problem he cannot understand it. 
This is the essence of the active psychology of meaning built up pri- 
marily through the efforts of Peirce, James and Dewey. It is only 
by striking the attitude rigorously determined by conditions outside 
his own experience, external to his background of meaning and 
generalization, that the problem-solver successfully recognizes the 
“felt difficulty” in the “‘forked-road”’ situation. 

In the creative attitude, however, the orientation is inward. 
It is subjective, not objective, as in problem-solving. The creating 
process is propelled by an inner urge to objectify moods, to portray 
overtly personal integrations of meaning, generalization, and emotion. 
The drive may be to write a poetic phrase or line or stanza, to portray 
something with pencil or brush, to put together a new combination 
of tones or bodily movements that will objectify a fusion of ideas 
and feeling. But the attitude adopted in the initial stage in the 
creative act is determined by reference to the subjective, inner experi- 
ence of the individual. 

There is also a second distinction. Whereas the ‘problem” 
of the problem-solver is external to the individual, the ‘‘problem” 
of the artist is internal. There is a difference in definiteness. Prob- 
lem-solving is focussed with sharpness upon conditions prescribed 
in the external world, such as the loads, span, strength of materials, 
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etc. involved in the design of the I-beam, the location of points involved 
in the determination of distance, or the prescribed horse power, 
number of cylinders, etc. in the design of the engine. The problem- 
solver must adjust with exactitude to these externally prescribed 
requirements. 

Not so with the orienting attitude in the creative act. It consists 
at first of little more than a vague restlessness, an undefined desire 
to express in an external product the internal experience of the individ- 
ual. This gives us, indeed, an important cue to the difference between 
problem-solving and creating; that is, the unchanging rigor and 
clarity of definition of the externally-set problem and the constantly 
changing indefinite character of the artist’s subjective vision. 

Now compare the appreciative attitude with that of problem- 
solving and creating. The first has resemblance to both the second 
and third, but it is much more like the creative attitude than that 
of problem-solving. What are its elements? Although it is stimu- 
lated by external conditions, it is not really oriented externally. It 
is oriented by the internal personal gathering-together of the self. 
In appreciating a tone-poem, a dance, a painting, statue or a building, 
the individual does not strive to comprehend the meaning of the 
artist, the dancer, or the designer. He strives only to catch the 
coordinate whatever reverberations come from the observed thing. 
Although stimulated from the outside he makes the response, the 
interpretation which his own mood, experience, and needs call 
forth. He ‘“‘appreciates’”’ in terms of what he is, what he feels and 
understands. It is a sort of confident gathering together of the whole 
personality. It is an attitude of awareness, as well as one of critique; 
it is all-embracing rather than analytic. 

In all three of these attitudes there is of course meaning, gener- 
alization, physical adjustment and emotional content, but there are 
distinctive differences in their amount and integration. The apprecia- 
tive and creative attitudes are effective only to the degree to which 
they are highly charged with emotion. The problem-solving attitude 
is effective only to the degree that the worker maintains emotion at a 
low ebb. This does not mean that the problem-solver is not also 
gathered-together emotionally. He is concentrated intently upon 
his task. Sometimes a thrilling orientation is perhaps essential to 
success. On the other hand it is frequently the emotional intensity 
of concentration which inhibits the making of appropriate 
generalizations. 
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So much, then, for the differences in orientation of problem-solving, 
creating and appreciating. 


5 


Consider, next the similarities and differences in launching the 
acts themselves. Dewey, in his classic analysis of the complete act 
of thought, leads us to the step immediately following the recognition 
of the problem—the flashing up of suggestions or solutions. In the 
case of problem-solving these solutions are hypotheses drawn from 
known data. They are tested against the requirements of the prob- 
lem. They are generalizations drawn from “‘facts’” and are accepted 
only when under scrutiny they fit the facts. 

In the creative act also this flash-like succession of fused emotion, 
meaning and movement resembles closely the successive appearance 
of solutions in problem-solving. But, as in the original orientation 
of the work, there are sharp distinctions between the two processes. 
The ‘‘suggestions”’ which flash up to the problem-solver are hypotheses 
from known data. The meaning of each is precisely fixed; generaliza- 
tion must fit the conditions set by the problem itself. In the creative 
process, however, the suggestions for modifying the art product are 
measured against the changing subjective experience of the person 
himself. ‘‘Solutions’” are accepted only as he perceives that they 
correspond to his felt moods. Thus the point of reference in this 


_ creative enterprise is subjective. 


In appreciating, “analysis” of a kind undoubtedly does play a 
part as it does in problem-solving and in creating. In listening to a 
symphony our enjoyment is enhanced, for example, by noting the 
recurrence of themes, the manner in which the tones of particular 
groups of instruments merge with others, and the like. Our apprecia- 
tion of a poem or of prose writing undoubtedly is augmented by 
mastery of form and style which opens to our emotional comprehension 
varied channels of enjoyment. Illustrations could be multiplied for 
other media of expression. There is no doubt, therefore, that in the 
fullest development of the appreciative act, the mind attends to 
separate phases or aspects of the process as well as the ensemble. 

As in the creative act, however, there is a fundamental distinction 
as to point of reference. There is no attempt to make the symphony, 
the landscape, the statue fit into prescribed objective standards. 
We take from the situation what we can. We integrate it with our 
on-flow of experience, but we must not interrupt the process of recep- 
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tive awareness to dissect critically and judge the validity of the art 
product. If we should do the latter, then the process becomes one of 
problem solving and not of appreciation. 


6 


Third, consider the ‘‘methods of work,” the on-going procedure 
itself in problem-solving, creating and appreciating. 

In problem-solving the individual ‘‘collects,’’ classifies and 
compares facts. If they are multitudinous, he condenses and sum- 
marizes them by statistical devices. These facts are measured 
against the scales of approximately equal units. Since I have dis- 
cussed this process in earlier chapters, I shall merely summarize the 
argument by saying that the whole process of problem-solving is one of 
drawing generalizations which fit the external requirements of the 
problem. 

How different is the creative method of work. The ‘facts’ are 
the lines, words, or phrases that well up out of the inner recesses of the 
artist’s experience. They are the tentative modelings in which the 
sensitive hands of the sculptor or painter attempts to objectify his 
moods and his vision. They are the imagined arrangements of light 
and shade, color and materials in the stage-set and costuming of the 
scenic designer. ‘They are the apparently uncoordinated arrangements 
of materials, dimensions, machine or instrument parts of the mechan- 
ical inventor. They are the interpretive bodily movements of the 
dancer responding to music. 

These are the “‘facts’’ of the creative process. Are they measured 
against external and precisely standardized norms? No; they are 
measured against the artist’s unique inner sense of the total situation 
to which he is responding. The succession of experiences is a proces- 
sion of changing, complex, unique situations. Hence, the utter 
impossibility that the artist shall systematically assemble meanings, 
lines, words, color, what-not in terms of externally known-in-advance 
conditions. As with the orientation and initiation of the creative 
act, the on-going process of the work itself is essentially objectification 
of moods, slowly, defining images, gradually shaping mental and 
emotional complexes. 

Obviously, meaning and generalization play a selective and 
interpretative réle as well as emotion. Again we say that the creative 
experience is a fusion of physiological processes, meaning, kinaesthetic 
and emotional reaction. Hence, intellectual elements play a part— 
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perhaps a guiding, orienting part. Definitely what part meaning- 
reacted-through-words plays in determining the actual step-by-step 
advance of the artist’s work we cannot say at the present stage of the 
analysis of the creative act. We lack definitive retrospective studies 
by creative artists of the development of their work. 

It is clear, however, that the “‘analysis” that leads the artist to 
“correct”? words or phrases, shapes, lines, arrangements of material, 
light and shade, combinations of tones, bodily movements is a fusion 
of emotion and intellectual meaning. Ideas play a decidedly impor- 
tant selective réle in dealing with media that are primarily intellectual, 
that is, words in poetry and essay, and symbols in mathematical 
invention. But they play a very minor building réle in dealing with 
the media of the graphic and plastic arts, music and the dance. Here 
feeling-import ousts ideas as the directive agent. Thus it is clear that 
analysis plays a part in the carrying on of the creative method of work 
as well as in problem-solving. 

This shows itself conspicuously in the conscious effort with which 
the artist gathers himself together, determinedly focuses his mind 
upon the manifestations of his moods, and drives himself to define 
the vision which serves as his inner goal. Thus he adopts an attitude 
of ruthless criticism of the objective portrayal which is appearing on 
his canvas, in his statue or musical composition. To the best of his 
ability does it correspond with his inner vision? New relationships 
are consciously sought, new combinations of words, lines, tones, 
shapes. There is a rigorous search for new elements in design. There 
is a dogged determination to discover and build upon unifying themes. 
This process inevitably sharpens the vision as well as produces a 
maturing objective expression. Thus the on-going analytical process 
of the creative act clarifies both the inner subjective state and the outer 
objective product. 

The goal in the act of appreciation, as we have already pointed out, 
is fullest awareness. The governing attitude is a gathered-together 
receptiveness, an attempt to embrace all the reverberations that radiate 
from the words, the tones, the lines, the masses of color, or the move- 
ments of the body, in the stimulating situation. The process is essen- 
tially one of “listening,”’ not of making and doing. Here is an example 
of “assimilation” that is closely akin to the creative act and quite 
foreign to the “assimilation” of the problem-solving act. In appreciat- 
ing we assimilate stimuli of tone, movement, or word-meaning, into 
our current moods. We “enjoy,” thrill over the new assimilation. 
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i 
e We build it into our “‘experience.”” We accept it, embrace it, are ie at 
ed lifted up, carried onward. t At 
he ‘Analysis’ plays its constructive part in appreciating also, but it fi ol | 
sa is more like creating than problem-solving. In music, for example, the ay | 
individual ‘‘notes” the use of specially related tones and chords, the ii 
to recurrence of basal themes, the integral unity of the composition. To i ) 
al, the extent that he is a master of the technical forms of construction, ie 
nang his enjoyment is enhanced by the recognition of the artist’s techniques. 
- Thus analysis leads the listener to create a new inner product; his 
al, fusion of impressions is unique; it is ‘‘himself’’ living on the maximum 
cal heights of appreciation. 
ith But he does not analyze in order to reproduce in himself what the 
— artist felt and saw in creating the product.' If he does so it happens 
at by chance, not by design. That is, he does not analyze to “solve an 
ork externally-set problem.” If he questions, for example: ‘‘Do I get the 
artist’s intent?’ he at that instant turns from appreciating to prob- 
' ch lem-solving. 
ind 
ine 7 
ide There is a fourth difference to be noted—namely, the difference in 
- the definiteness with which the goal of the worker can be visualized and 
his the success or failure of achieving it ascertained. The problem- 
- solver knows in advance the problem he is to solve. It is the discovery 
- of the “law” of relationship between factors that change together. 
om His method is to discover the exact combination of factors—materials, 
— volumes, loads, pressures, repetitions, lengths, masses and the like— 
clin which will bring about the stated conditions. The very statement of 
- the problem fixes precisely the goal of the worker; it is unchanging 
naa throughout the entire mental process. 
' Not so in the production of a creative thing. The goal is the 
“ clarification of the artist’s vision and its objectification. He must see 
onl and feel clearly enough to portray his vision on a canvas, or in a poem ¥ 
- or dance or musical production. The goal of the creative artist, es 
«nl therefore, changes constantly. The nature of the creative process is i 
wn tentative and hesitant; there are constant interlineations, erasures, tn 
5 “a the giving-up of old achievements, the adoption of new experimental ie 
‘ate arrangements. Hence, also, there is the continual attitude of i 
* discontent. ops: 
into o® 
ion. ; 





1 Hence the futility of most “criticism”’ in the arts! 
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One more important distinction remains to be noted between 
the outcome of problem-solving and creative production. The 
solution of a scientific problem can be definitely confirmed or refuted; 
that of the artist’s problem cannot. It is of the essential nature of 
scientific work that a generalization once achieved in the solution of a 
problem is susceptible of exact confirmation by independent workers 
at other times and places. Indeed, inferences drawn by the scientific 
worker are not regarded as scientific “law’’ unless they can be exactly 
confirmed by other workers utilizing the same set of procedures. 
Thus, a central element of the scientific method is the inevitability 
of the discovery of recurrence. 

Exactly the opposite is true with the creative product. The goal 
of creative production must be a unique thing. It is a painting, a 
poem, a tone poem, an oration, which is an objective portrait of an 
inner personality. Hence the impossibility of confirmation, or of 
refutation of the product of such a personality by another. 

By what standards shall its confirmation be measured? By 
whom is it to be confirmed or refuted? The product is the artist’s 
personal record of Self. At any given moment no two human beings 
in the world are even approximately alike. Thus it is inconceivable 
that, except by the remotest operation of chance alone that the peculiar 
fusion of feeling-import, meaning and bodily understanding achieved 
by an artist could ever be achieved by another. And if they were so 
achieved, it would not be a “confirmation” of the original artist’s 
“‘generalization.’’ It would either be sheer imitation, or a new original 
product of the second artist. If it were the latter, it must be meas- 
ured against his personal vision, not against that of the first artist. 


8 


Our comparison of the acts of problem-solving and of creating 
throws light upon the current confusion of thought concerning “‘repre- 
sentative art” and ‘creative art” in the schools. We can see now that 
these differ definitely, but that each is necessary in the education of the 
cultured youth and in, the progressive reconstruction of society. 
Each plays its important réle in developing tolerant understanding and 
dynamic participation in modern life. To get the issue clearly before 
us let us first illustrate what is meant by “representation,” by ‘‘repre- 
sentative” art. | 

From the classrooms of the child-centered schools emerge a host 
of thrilling illustrations of the use of representative art. In a first 
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grade unit on local community life is a model of a miniature city: 


Grocery store; post office, city hall, town park, railroad station, each 
item made by a pupil in the class. Together these part-way creative 
products constitute a representation of the community. 

Likewise, in the fourth grade is a more mature representation of 
the community, a map drawn roughly to scale showing the districts 
of the city, rivers, chief industries, names of transporting and communi- 
cating facilities, schools, municipal government buildings, and residence 
districts. Here, too, is “representation,’’ of a more mature type, 
however, than that of the first grade “‘play city.” 

Consider another—a dramatization of a play written by children 
in the fifth grade depicting stages of community development. 
“Research” has been carried on by the young people to make the 
costumes, appearance of houses, farms, factories and stores ‘‘represent”’ 
fairly the civilization depicted. 

A sixth grade class has been studying the European background 
of American history. Witness the models of medieval castles and 
towns, a painting depicting life on an English manor, another illustrat- 
ing transportation in the sixteenth century. | 

These examples, culled from the experimentation of the new schools 
throw into clear relief both the importance of representative art in the 
schools and of its essential characteristics. At the same time they 
provide us with clear illustration of the distinctions and the similarities 
between representative art and creative art. Note these succinctly: 

First, representation employs the essential attitudes and procedures 
of problem-solving. The individual confronts a problem, namely 
that of portraying with relative fidelity the life of the region, group or 
period, or the structure and form of the plant, animal or society under 
consideration. The orientation is upon a stated set of conditions, 
factors, needs; it is outward toward an externally set problem. 

Second, the student collects ‘facts’? as in problem-solving. He 
searches for historically correct models, for information concerning 
costumes, modes of transportation, language, what-not. In “repre- 
senting”’ life he is obligated to approximate the “‘truth.”” He attempts 
to get a “true” feeling for the life of the group, the community, the 
nation in the period under consideration. 

In the third place, a necessary measure of one’s success in represen- 
tation is the degree to which he conveys a message to an audience, the 
degree to which he portrays to others a corresponding feeling for the 
life which is being depicted. This is the very essence of representative 
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art. That it is essential to the sound carrying on of the education of 
youth is clear. Scholarly, technically sound, representative art must 
occupy a growing sphere of influence in our progressive schools. 

But imperatively essential as it is, representative art must not be 
confused with creative art in which the individual’s objective portrayal 
is not controlled by the desire to reproduce either current or earlier 
forms and conditions of life. The outline which we have already given 
of the representative process of the creative process show clearly the 
likeness of the former to problem-solving and the clear distinctions of 
the latter. 

Two conclusions of crucial importance are possible as a result of this 
discussion: First, representative art and creative art are two different 
things; second, both are necessary in our schools. Representative 
art will supply a crucially needed means of artistic expression in 
building a clear understanding of our changing society. Creative art is 
indispensable to the complete development of the cultured man. 
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TETRAD-DIFFERENCES FOR VERBAL SUBTESTS* 
WILLIAM STEPHENSON 
University College, London University 
INTRODUCTION 


A group test of ‘‘verbal” subtests was applied by the author to 
1037 girls, followed on the next day (per testing group) by a group test 
of “non-verbal’”’ subtests. The latter has received attention in a 
previous paper.’ The purpose of the present paper is to determine 
what we can of the factor-characteristics of the verbal subtests, 
relative to the verbal subtests themselves: Data will be gathered con- 
cerning the satisfaction of the Theory of Two Additive Factors by 
verbal material. Further consideration must needs be given to error 
other than sampling in the tetrad-differences: As we suggested in the 
previous article,’ sampling is likely to be only one of many sources of 
error. For small populations sampling error is large in comparison 
with most other sources of error in tetrads that we have knowledge of. 
But while sampling error is diminished by increase of population, we 
may have no such control for other errors: Errors insignificant (relative 
to sampling) for small populations must needs become more and more 
observable as sampling error is diminished. Thus, contemporaneously 
with examination of material in terms of the Theory of Two Factors, 
we must give consideration to error other than sampling, and knowl- 
edge of such errors is one of the objects of our work. 


THE VERBAL SUBTESTS 


The verbal group test consisted of eight subtests, named ‘‘verbal’’ 
because the test-units made use of printed words, phrases, or sentences 
or paragraphs. Samples of the subtest test-units follow, each test- 


unit showing correct responses. Each subtest will be known hereafter 
by its number 1, 2, ete. 


Subtest 1.—Synonyms (inventive). (Responses in italics.) 
1. tall high 


2. sharp quick 


Subtest 2.—Sentence completion. 
1. Birds build their nests in Spring. 











* The author is greatly indebted to Professor C. Spearman, under whom the 
work reported here was accomplished. 
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Subtest 3.—Classification. 


1. cap stocking coat (toffee girl wear scarf nut) 
2. dog cow elephant (brick pint kitten coal tin) 





Subtest 4.—Interchanged words. 


1. greenhouses are grown in tomatoes 





2. The dark had long girl hair. 


Subtest 5.—Opposites. 
1. come fill go ten 


2. every success once failure 
Subtest 6.— Analogies. 

1. cat: kitten:: dog: 
Subtest 7.—Always has. 








(horse puppy foal mice catkin) 
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1. Aman alwayshasa .. . (cigar body wife money head) 
2. Abird alwayshas .. . (cage seed wings nests legs) 


Subtest 8.—Following directions. 


1. Write the first letter of the alphabet . . 


2. Put an “x” under an “‘o”’ in the answer space . . . 
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Each subtest occupied one side of a sheet of paper, quarto size. 
The material was typewritten-cyclostyled, print being {9 inch high 
(small letters). The stapled sheets were given to the testees with 
subtest No. 1 showing, and the various matters of test routine were 


gone through with subtest No. 1 as a sample. 


Each subtest was pref- 


aced by a set of six test-units displayed and worked through on the 
classroom blackboard, the test-units being printed and spaced as in the 


particular subtest. 

















TABLE | 
Number b eran Mean of | Sigma of 
Subtest of test- | OW crude | crude 
units a scores | scores 
minutes 

em Te 20 2% 8.85 3.99 
2. Sentence completion................. 25 4 9.98 3.50 
ei iin denn b nad obenés 24 2 7.83 3.22 
4. Interchanged words................. 20 3 5.35 2.33 
C2 uct oss he cans pendadsa eae 26 3 8.32 3.18 
CL. Sib s badd naicn oe sane elie 25 2% 7.55 4.09 
Ln aga cea Fae ce cekans 26 3 9.54 3.24 
ee cnn i'w't 6:4 4 padiae tes ae 14 3% 5.00 1.72 
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The subtests in order of application, time allowances, number of 
test-units, and crude scores and sigmas, are given in Table I. The 
whole group test took forty minutes for complete application. 


INTERCORRELATIONS AND TETRAD-DIFFERENCES 


Crude Scores.—Intercorrelations for subtests 1 to 8, with age, 
are given in Table II. The difference formula for r was used, for crude 
scores. The correlations are a result of checkings, both in the normal 
course, and at the instigation of the tetrad-differences themselves. 
We examine the correlations in terms of the Spearman Theory of Two 


TaBLE I].—PRopUcT-MOMENT CORRELATIONS FOR 1037 GirLs; CRUDE Scorgs; 
VERBAL SUBTESTS 














Age . is 3 4 5 6 are ie 

1065 | 0947 | 0795 | —0018 | 0822 | —0018 | —0686 | 0321 
1 6408 | 5706 | 4289 | 5151! 4408 | 5680 | 6110 
2 .... | 6121 | 5691 | 5980 | 5770! 5897 | 5799 
3 _|.... | 4701 | 5871 | 5524] 5724 | 5236 
4 Ke eee; DN 4556 | 5049] 5010 | 514 
ees aan Rar Stacks paid | .... | 51741] 5654 | 5240 
ER, SR eT, Redlegs: SER) MER acids! 5609 | 5014 
7 Pe E.R alice Coe jain 5859 
“ | | 























Additive Factors.?. The influence of age is neglected for the present: 
Compared with other disturbances to be considered, age effect is but 
slight. Table II provides the following tetrad data: 


Mean of 210 tetrad-differences....................0c000- 0.0323 
Conventional observed pe (Mean X 0.8453).............. 0.0273 
RS a i oe a a 0.0104 


We thus find error in the tetrads, over and above that attributable 
to sampling. 

The tetrads involving subtest No. 1 have observed pe of amount 
0.0353, and supply some of the largest of the 210 tetrad-differences. 
This subtest was introduced as a “‘shock-absorber,”’ and was in view 
of the girls during the preliminary explanations of testing routine— 
it is probably potent as a disturber of tetrads because of a bias for 
“speed.”’ It seems that we cannot take liberties of the kind granted 
to subtest No. 1, without having disturbances resulting in tetrads 
which involve the subtest. In the circumstances it is perhaps legiti- 
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mate to omit subtest No. 1 from all further deliberations and this, in 
any case, is the procedure adopted: Its retention would not greatly 
effect the data obtained from consideration of the remaining seven 
subtests. The latter, subtests 2 to 8, provide 105 tetrad-differences: 


Mean of 105 tetrad-differences...................000008: 0.0229 
Observed pe (Mean X 0.8453)............... ccc eee eee 0.0194 
Observed sigma (0.6745°/ Dl?/n)* .. 0... ce cece cece cece 0.0187 
Theoretical PE approximately. ..................200005- 0.0104 


* Where ¢ stands for tetrad-difference, n their number. 


We note error of amount about 0.015 in excess of that that can be 
attributed to sampling. 

Standard ‘‘Normal” Scores.—Following the procedure already used 
for the non-verbal subtests, the crude scores used for the correlations 
of Table II were converted to the standard ‘‘normal’’ scores given 
previously.’ Subtest No. 1 was not rescaled. After rescaling, all the 
subtests have the same “normal’’ distribution of scores. The new 


TaBLE III].—PRopuct-MOMENT CORRELATIONS FOR 1037 Grrus: STANDARD 
‘*NorMAL”’ Scores; VERBAL SUBTESTS 




















2 | 3 4 pe 7 | 8 

2 | 6013 | 5589 | 5985 | 5676 | 5845 | 5883 
3 | .... | 4623 | 5924 | 5266 | 5509 | 5106 
4 | _... | 4474 | 4593 | 5019 | 5046 
5 | ... | 5550 | 5745 | 5121 
6 5697 | 5220 
7 

8 





Os een ree ne eet. fees 5929 





correlations for subtests 2 to 8 so rescaled are given in Table III, age 
being neglected. The mean intercorrelation is now 0.5424, compared 
with 0.5456 for the corresponding 7’s of Table II. Table III provides 
the following data: 


Mean of 105 tetrad-differences ie ORI Ra GR SNe ROMs PLL 0.0249 
Observed pe (Mean X 0.8453)...............2 2 cece eens 0.0211 
Observed sigma (0.6745°/ Di2/n)* .. 0.00... cece eee eee 0.0202 
Theoretical PE approximately..................200e0e8: 0.0104 


* Where ¢ stands for tetrad-difference, n their number. 


It is apparent, then, that the rescaling has not freed the table of 
the excess error observed for Table II. Our immediate object is 
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to search for acceptable disturbances in these tetrads. Throughout 
we shall make use of Table III, the correlations being there free from 
scaling anomalies. 


CONSIDERATION OF THE RESIDUAL ERROR 


Age is of but slight effect on the tetrads and, from our experience 
with the non-verbal subtests, we take the correlational calculations 
to be too accurate to allow of calculational mistakes being suggested 
as an explanation of the error obtained for the tetrads of Table III. 
What has been said in a previous paper’ concerning calculation mis- 
takes holds equally well here: There must be some error attributable to 
calculational mistakes, but we take it to be slight; and the various 
correlations should be relied upon even if it is found that the tetrad- 
differences are slightly inaccurate, because the correlations receive 
most attention. The calculations for Table II were quite different, 
from beginning to end, from those for Table III: The crude scalings 
were satisfactory approximate “normal” distributions, so that the 
correlations for Table II should differ but little from those for Table 
III. As we see, the correlations are in fact but little different. Such 
considerations, and similar details to be observed in the course of our 
work, together with the help provided by instituting recheckings of 
calculations at the investigation of the tetrads themselves (a correlation 
associated with large tetrad-differences is first suspected of cal- 
culational mistakes, and rechecking is made for that correlation), tend 
to verify our acceptance of the various correlations as being reasonably 
free from calculational mistakes. 

We can isolate no single specificality in the case of Table III (or II), 
i.e., none showing uniquely, as in the case of the r,, ,,,; correlation or 
the non-verbal subtests.? We therefore proceed to examine Table III 
in the light of influences already known to us, so that, by eliminating 
some, a limited number may be left for final consideration. 

Similarity of Relations.—The Classification, Opposites, and Anal- 
ogy subtests (3, 5, and 6 respectively) involve very similar “‘likeness,’’ 
or ‘‘unlikeness,”’ eduction—the classification test-units can be answered 
in terms of “not-similar,’’ or “opposite,’”’ the analogy test-units 
involve both “opposites” and “similars,’’ while the opposites test- 
units likewise sometimes may imply “‘similars.’’ Thus, for example, 
the analogy test-unit is constructed of ‘opposites’ in the following 
sample: 
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black: white:: dark: ? 





Now these three subtests have been found by Davey’ to entail specif- 
icality which, not unreasonably, was attributed to too great similarity 
of the relations involved in the test-units. It is probably, then, that 
the same effect enters into our subtests. We can free our data of such 
an effect by omitting all tetrads involving r35, rs¢, and rse, leaving for 
consideration the tetrads that cannot be disturbed by the probable 
specificality. The process leaves 57 tetrads, with observed pe (con- 
ventional, 7.e., observed mean X 0.8453) of value 0.0181, the 
theoretical PE being approximately 0.0104. Hence, even were specif- 
icality acceptable for these correlations, the residual error for Table III 
is just slightly reduced by eliminating the disturbance. 

“Tdiosyncrasies.’’—Of other possible disturbers of tetrads previous 
mention has been made of test-constructor’s idiosyncrasies.‘ Thus, a 
battery of Thorndike CAVD subtests compared (by means of the 
tetrad technique) with, say, Spearman oral subtests, shows specificality 
in the one set of subtests relative to the other. The specificality may 
be due to the fact that the Spearman test-units are orally presented; 
or to scholastic influences in the Thorndike test-units (showing in 
Arithmetic, Word-ability, Understanding of paragraphs, etc.); or, 
indeed, to both or other influences. 

Now of the verbal subtests in Table III, Nos. 2 and 5 were largely 
Thorndike CAVD test-units (sentence-completion and opposites), 
while the others were of my own construction (although following well- 
known patterns). The tetrads for the five subtests of my construc- 
tion (3, 4, 6, 7, and 8) have value as follows: 


Mean of 15 tetrad-differences..................c0ceeeuee 0.0148 
Mean X 0.8453 (observed pe)..................00eeeeeee 0.0125 
Theoretical PE approximately.....................00000- 0.0105 


It is of interest to note that the largest tetrads among these 15 
are associated with the correlation 73. (considered above) : Omission of 
’s¢ gives 9 tetrads with value: 


Mean of 9 tetrad-differences.................0.ceeeeuees 0.0131 
Mean X 0.8453 (observed pe)................ceeeeececes 0.0111 
Theoretical PE approximately....................e0000- 0.0105 


(If the comparable correlations for crude scores are considered 
the tetrads still show ‘error of amount 0.015 when ri¢ is omitted: So 
that if we accept the calculations, error of amount 0.015 appears to 
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arise from scaling anomalies, a result in accordance with conclusions 
drawn in the previous paper.’ 

The subtests 2 and 5 may be disturbed by difficult or unfamiliar 
words; subtests 3, 6, 7, and 8 contain only what must be taken to be 
familiar words with “concrete” meaning (see the various samples 
already given) and verbal subtests less influenced by critical word- 
ability could scarcely be constructed. Subtest 4 involves somewhat 
complex sentences, with abstract structure. 

The omission of tetrads involving r25 would free the tetrads for 
Table III of any influence characterizing these two subtests alone. 
But residual error of amount 0.019 remains if res alone is omitted from 
the tetrads for Table III, and if, in addition, we omit tetrads involving 
r35, 736, aNd 756, SO making allowance for possible “similarity of rela- 
tions,” this residual error becomes 0.018 (over and above sampling 
error). Thus idiosyncrasy for subtests 2 and 5 provides no obvious 
specificality relative to the other subtests. 

The author’s work with various group tests—such as the Otis, 
Spearman, National, Thorndike, etc.—has ‘shown the potency of 
influences entailed in different test-units constructed by one and the 
same psychologist:* Above, we suggest, is an example of the same 
phenomenon. Had the author constructed the eight verbal subtests 
himself, (in passing we note that subtest No. 1 was taken from work by 
Slocombe’) he is of opinion that the eight might have shown agreement 
with the tetrad criterion as satisfactory as that shown by 3, 4, 6, 7, and 
8. Subtests 2 and 5, acting as “reference values,’’ show that dis- 
turbances are latent in the five subtests. 

“Speed Preference.’”’—An example of speed preference is given in the 
previous paper,’ for subtests II and VIII. We found that the effect 
could be controlled in terms of the errors made in subtest VIII. 

Subtests 2, 4, and 8 (and 1) were free from errors: Each test-unit 
was either fully correct or not attempted. Subtests 3, 5, 6, and 7, 
show frequent errors. Sheer guessing, the scores indicated, was rare. 
We judge introspectively, and from the prevalence of mistakes made 
by the testees, that subtests 3, 6, and 7, are most likely to entail a 
“speed” effect. Subtest 5 is ostensibly similar to 3, 6, and 7, and may 
be judged to entail “speed preference”; but the test-units in subtest 5 
have qualities that do not lend themselves to “speed preference.” 
Furthermore, 135 and rsx have received prior notice in terms of similar- 
ity of relations, and the inclusion of these correlations in the tetrads 
about to be considered is open to criticism on that account. The 














~ a ee eee ae | “ 
Re os « “ “oer Bape ea = - 
o atthe ion * 
Fars oot SF 3 


a las 
= - 
. 
3) “> a ~ 
x5 Se GE: -_— te Sisk CS ee ¥ = 
. a eee os _ ae > 


wr SS, 
ae ess 


- - 
e + +. 


= gy 


PATA FI TO 


a 
. . - =, 
7 


e. QS ww ss 


_— 


on ee a * a 
ee ee mse “1 


or 


én ae we i awe —. P| 


ME RAR FE 
7. aed -_ 


Fut. tied Pie a 


. ee 
Sy at 


Se ee ee eee 





dene So - 


at ay o 


so pe 
ta = Shia k 
= Rae a Tle 8. 


se 
5 





g 
a oe =a 


EE ee ee 





. 
sae Ee Gererag. 








262 The Journal of Educational Psychology 


influence of ‘‘speed’’ will be most noticeable in tetrads of the type 
Tyee Taia2 — "aia. ° "e202 = if (1) 


(s stands for a subtest involving “speed preference,”’ such as 3, 6, and 
7; and q for a subtest taken to be satisfactory for the speed-quality 
functioning). 

There are 36 such tetrads for the s subtests 3, 6, 7, and 5, taken 
together with the g subtests 2, 4, and 8. These have observed pe 
0.028—quite a significant result. Omitting these 36 tetrads from the 
full 105 for Table III leaves 69 tetrads, with value 


Mean of 69 tetrad-differences.................22200 eee 0.0206 
Mean X 0.8453 (conventional pe)....................4.. 0.0174 
Theoretical PE approximately....................02008. 0.0105 


But if subtest 5 is not taken to be “‘speed”’ biased or, if prior regard 
is taken of the similarity of relations noted for r35 and rs¢ (and probably 
r36), then the mean of 18 tetrads of form (1) for s subtests 3, 6, 7, 
and q subtests 2, 4, 8, is 0.024: And elimination of these leaves the 
tetrads for Table III undiminished. We see, then, very little clear 
evidence that ‘‘speed preference”’ has entered significantly in our 
subtests, judged, that is, among themselves. The evidence, of course, 
might be otherwise if different ‘‘reference values” are employed. 

“‘Propinquity’’ Influences.—These influences may be attributed to 
objective or subjective fatigue (especially when the testing time is 
lengthy, taking two hours or so) as “‘ habituation,” and as “‘end-spurt.” 
Calculation mistakes may be suspected too, especially in r;2.—which 
is usually the first correlation calculated. The influences show in 
tetrads of the type: 


T12° T7173 — 117° Tes = f 


Control of such influences was attempted in our work. Subtest 1 
has been omitted from calculations partly because it served as an 
‘‘habituation,”’ or ‘“‘shock,’’ absorber; the testing time was reasonably 
short; subtest 7 was quite unlike subtest 8 in routine requirements, s0 
tending to break any ‘‘set’”’ toward extra effort in the last two subtests; 
every endeavour was made to dispel initial indifference, apprehension, 
or ‘‘speed preference.” 

The observed conventional pe of all tetrads involving r7x is 0.0244. 
But the largest of these tetrads have rz5 and rs¢ likewise on the left- 
hand side of the above type of equations, and omission of tetrads 
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involving these correlations leaves 16 tetrads, now with pe 0.0205, 
Omission of all tetrads involving r7s, 735, and 156, leaves 51 for Table III, 
with pe 0.0168. However, what little evidence may be got from these 
results for r7s is offset by the fact that no disturbance is apparent for 
rr3 When the author’s subtests alone are considered (see section on 
Idiosyncrasies). The data thus give no indication of disturbances 
attributable to propinquity effects. 

The Influence of Testing by Groups.—This has been suggested by 
Professor Spearman’ as a likely source of error in tetrads. In the case 
of the present verbal material a check of the influence can be made by 
calculating separate correlations for each group of girls tested, averag- 
ing the correlations so obtained. For our 1037 girls we had 21 testing 
groups. The calculation mistakes entailed in working out such a 
large number of correlations might outweigh the error to be expected 
as attributable to group testing. It is possible that, by calculating 
separate correlations for each testing group, we might arrive at inter- 
correlations giving sampling error values to the tetrads for Table III. 
But we would be concerned with the slight specificalities under con- 
sideration for Table III, the whole being relative to the verbal subtests 
themselves. 

We note that no error need be (nor can be) attributed to group 
testing anomalies in the case of the non-verbal subtests: And the 
largest error available in the case of the verbal subtests can be of the 
order 0.015 only. In the case of the five subtests constructed by 
the author, 3, 4, 6, 7, and 8, there need be no error attributable to 
group testing anomalies. | 

But there can be no doubt that group testing, on occasion, may 
introduce large disturbances in tetrads. Thus, correlations may be 
much influenced if one class, having been taught geometry, is given a 
test involving geometry, while other classes, given the same test, have 
received no such instruction in geometry. It may be taken, we 
suggest, that our non-verbal subtests would be most open to influences 
of this kind (consider, for instance, the subtests III, VI, and VIII as 
given in the previous article’). Except in terms of vocabulary 
instruction, such influences cannot readily be taken to enter the verbal 
subtests. Previous practice with similar subtests would not cause 
disturbances in the tetrads, if all the subtests received the practice, 
or, in effect, received the practice. Group testing may introduce 
disturbances in other ways; for example by mistakes in timing subtests, 
or through faults in testing technique for particular subtests for some 
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testing groups. All the testing was done by the author, and sugges- 
tions of timing or test application or marking faults can be 
discountenanced. 

A procedure to be described in the next section might be expected 
to negate broad influences that may arise because of group testing: 
The procedure leads to no material diminution of the error in the 
tetrads for Table ITT. 

The Influence of School and Class.—As has been reported in the 
previous paper,’ the girls for our work were drawn from eleven schools, 
and from Standards IVA to VIIB in these schools (see Table II’). 
Certain schools were by repute of better standing than others. We 
could determine intercorrelations for the subtests and reputed school 
standing. Similarly, a point scale could be made for school standard. 
The influence of school and school standard could then be partialled 
out. Instead of making two sets of correlations necessary, our 
purpose can be served by combining both reputed standing and school 
standard (class) into one measure. A 16-point scale was devised 
(hereafter called the C-measure), the foundation being the order of 
class (IVA to VIIB) within which girls of classes with high reputation 
were accommodated. The various Standards IV were scaled 0, 1, 2 
etc., where the reputed “poorest”? school was scaled 0. The score 
was the same for every girl in a particular class: And it depended upon 
opinion given by teachers, upon scholarship returns, and, objectively, 
upon the school class. We believe the C-measure to be a satisfactory 
measure of the relative schooling and environmental influences 
exerted on the girls. 

To Table III we can now add the following correlations of the 
subtests 2 to 8 with the C-measure: 0.4598, 0.4291, 0.3130, 0.3659, 
0.4163, 0.4451, 0.4110, respectively. 

The partial correlations, for the C-measure partialled out were 
next calculated, and the correlations provided tetrads with the follow- 
ing value: 


Mean of 105 tetrad-differences, Table III, with C-measure 


I a Bind Sel ears a win bs-6 0a. 5 94:08 6000 Oe ceale 0.0253 
Observed pe (Mean X 0.8453)............. 0. cece eee 0.0214 
Theoretical PE approximately...................000008- 0.0105 


Furthermore, we can try out all the various possible influences 
discussed in the previous sections, using the partial correlations 
instead of those given in Table III. In all cases the conclusions 
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arrived at are in no way different from those arrived at for Table III. 
The disposition of error in the tetrads still shows rss and rs¢ as sources 
of large differences; a try out of the theory of ‘‘speed preferences’’ 
gives much the same results; and the five subtests constructed by the 
author again have good agreement with the tetrad criterion. Thus, 
it is not in terms of a C-measure that residual error in the tetrads for 
the verbal subtests can be explained. 

The Influence of “‘ Verbality.”-—Ostensible characteristics of the 
verbal subtests are as follows: 


(a) Subtests 2, 4, and 8, involve phrases and sentences or short paragraphs. 

(b) Subtest 5 (and 1 to a lesser degree) critically entails difficult words—for 
instance, test-units of this type: 

“airy uniform blithe diversified” 

(c) Subtests 3, 6, and 7, make use of simple words, all well within easy under- 
standing of the girls tested (i.e., according to their g-ability—we do not suggest 
that the dullest girl tested had knowledge of all the words in these subtests). 
(It is this easiness of vocabulary, and simplicity of the relations involved, that 
makes subtests 3, 6, and 7 particularly open to ‘‘speed’’ influences.) 

(d) Subtest 8 also involves easy words, equally understandable by the girls, 
but we cannot readily take it to be ‘‘speed”’ biased. 

(e) Subtests 2 and 4 are not so critical for vocabulary as is subtest 5. Sub- 
tests 2 and 4, however, are superficially much alike—both are essentially ‘‘sen- 
tence completion.” 

(f) A generalization from influences of the kind just passed in review ((a) 
to (e)) is got in terms of the fact that for any given fundaments—of words, or 
phrases, or other linguistic structure—a critical matter may be (1) either that the 
word or phrase, etc. has to recall meaning, or (2) that meaning (already educed) 
has to recall words or phrases, etc. Thus, reproduction, recall of a word or phrase 
or meaning, may be critical in some test-units: But, education, not reproduction, 
characterizes the universal g-factor, and critical reproduction may thus possibly 
act as a disturber of tetrads. 


What, then, has our data to offer concerning the above character- 
istics and influences? Obviously it will be of greatest value to refer 
the verbal material to the non-verbal subtests, keeping in mind the 
above possible influences. This, however, is to be our concern later. 
For the present we note the following facts: 

Tetrads of type (1) for s as subtests 3, 6, and 7 (taken two at a 
time) and g as subtests 2, 4, 5, and 8 (two at a time), have observed 
conventional pe 0.0201, while the observed pe for all the other tetrads 
of the table is 0.0216. If we allow for disturbances attributed 
to “similarity of relations’? the observed pe for tetrads of type (1) 
becomes 0.0174, compared with 0.0178 for the rest of the tetrads of 
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the table. There is no evidence, then, for specificality in subtests 
2, 4, 5, and 8, relative to 3, 6, and 7; 7.e., nothing is observed of the 
nature of a ‘‘speed preference” for subtests 3, 6, and 7; and nothing in 
terms of ‘‘ verbal ability” in subtests 2, 4, 5, and 8 (for difficult vocabu- 
lary and understanding of connected discourse). This conclusion, of 
course, is relative to the verbal subtests amongst themselves. The 
results do not seem to promise assistance by offering explanations of 
the excess error obtained in the tetrads for Table III. 

In terms of the verbal subtests alone, it would be making too 
much of our material to enter into the possible influences of reproduc- 
tion on the tetrads. Should a sufficiently broad specificality be 
observed in the verbal subtests relative to the non-verbal subtests, 
which receives no acceptable explanation in terms of influences such 
as those of “‘speed preference,” “‘idiosyncrasies,’”’ ‘‘group testing,” 
and the like, then a most important source of disturbances of tetrads 
might well be reproduction effects. All this, however, is a matter for 
future consideration. 


R&ésuME AND CONCLUSIONS 


We have applied the Theory of Two Factors to the intercorrelations 
for Table III, and find that the observed error in the tetrads is slightly 
in excess of that expected from sampling error alone. 

Previous work ‘confined to verbal subtests has shown some agree- 
ment with the tetrad-difference criterion, but most of the previous data 
are for small populations only, where the sampling error is large in 
comparison with other errors, so swamping them. When, however, 
the sampling error is small (as in our work) some observed excess 
error, of the kind that we have found, is to be anticipated and, indeed, 
is demanded by the Theory of Two Factors. The observed excess 
error for Table III, then, is not incompatible with the Theory. But 
the elimination of such small excess errors would only slightly reduce 
the g-saturation of the correlations. 

Our problem has been that of seeking acceptable explanations for 
the excess observed error in the tetrads for Table III. We have 
applied to our data certain theories of disturbances of tetrads that have 
been found of explanatory value in previous work, trying, in particular, 
the influences of age effects, calculation mistakes, ‘“‘speed preferences,” 
‘‘propinquity,” tester’s idiosyncrasies, school and class, group testing 
anomalies, and ‘similarity of relations.’”’ Of the various suggested 
disturbers of tetrads, that of ‘similarity of relations’’ for the Analogy, 
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Opposites, and Classification subtests appears to be the most dis- 
cernible, but it is not at the root of the major part of the observed 
excesserror. ‘‘Speed preference,” age, school and class, and superficial 
verbal likenesses of the subtests, give no clear indication of being 
appreciably contributory to the excess error. We are in some doubt 
about the magnitude of error attributable to group testing anomalies. 
Error is perhaps likely, but the non-verbal subtests’ gave no indication 
of being so influenced, and it is perhaps among these subtests that 
we might most expect such group testing effects to be disturbers of 
tetrads. We find, then, that none of these suggested probable dis- 
turbers singly can be taken to account fully for the observed excess 
error in the tetrads for Table III. The sources of error so far brought 
forward appear to be inadequate to explain the whole of the excess 
error. 

We might accept, provisionally, a theory that a sum of many 
small disturbances of the kind passed in view in the course of our 
paper may be taken to explain the observed excess error. This 
would mean that elimination of all the suggested small disturbances 
would but slightly reduce the g-saturation of the correlations. For 
further light on this matter, however, we must look to comparison of 
the verbal and non-verbal subtests, a study that we are to report in the 
next paper. 
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FOUR TYPES OF EXAMINATIONS COMPARED AND 
EVALUATED 


ALVIN C. EURICH 


University of Minnesota 


Since objective examinations are used so extensively to measure 
achievement in college courses, it may no longer be necessary to refer 
to them as new-type examinations. This does not imply that such 
measures of accomplishment are fully evaluated nor does it even 
suggest that diligent scrutiny of their results has become anachronistic. 
In fact the greater use of these examinations makes imperative even 
more extended studies of their relative validity and reliability as well 
as the inter-relationship that exists between them. Such investiga- 
tions should be conducted in a rather wide variety of specialized 
courses now taught in colleges and universities. The well-known 
work of Wood! at Columbia University has done much to stimulate 
a widespread interest in evaluating various types of achievement 
examinations that are now in vogue. The particular investigation 
reported within these pages is an outgrowth of that interest. Its 
purpose is to evaluate the essay, completion, multiple-choice, and true- 
false examinations when each type covers exactly the same subject- 
matter. 


MerHOD 


As far as can be determined the unique feature of this investigation 
rests in the method employed to evaluate the particular types of 
examinations used. In constructing the tests a traditional essay-type 
examination was first prepared to cover the salient points of the 
subject-matter. Following the formulation of the questions the 
answers were written out in detail and the total number of items to be 
included in each question was thereby determined. The next step 
in the procedure was to construct a completion examination covering 
exactly the same material as that embodied in the essay type. To be 
assured of this, the completion statements were made up from the 
answers to the essay questions. A multiple-choice examination was 
next constructed in the same manner; and following this, a true-false 
examination. Thus the completion, multiple-choice, and true-false 


1 Wood, Ben D.: * Measurement in Higher Education.”” World Book Co., 
1923. 
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tests were each derived from the correct responses to the essay-type 
questions. 

This procedure was followed in two experiments at the University 
of Minnesota: 

The first of these was conducted with a group of students enrolled 
in a course in statistical methods during the first term of the summer 
session Of 1927. In the class, there were one hundred seven seniors 
and graduate students of the College of Education, ninety-nine of 
whom were present on the day the tests were administered. 

In the second experiment, the group consisted of one hundred six 
juniors and seniors enrolled in educational psychology oung the first 
term of the summer session of 1929. 

In both cases the only knowledge which the members of the class 
had concerning the experiment was that they were to take a two-hour 
midterm examination. They were told that this examination was to 
consist of four parts: Part I, an essay examination; Part II, a comple- 
tion examination; Part III, a multiple-choice examination; and Part 
IV, a true-false examination. They were further informed that the 
examination would cover all the material in the course up to the middle 
of the term. 

The total number of points in each type of test and the amount 
of testing time allotted are included in Table I for the first experiment 
and Table II for the second. The examinations used in Experiment 
II were somewhat longer than those in Experiment I although the 
amount of time allotted to each was greater only in the case of the 
true-false test. The experience with the first investigation served 
as a basis for this adjustment. The tests are listed in the tables in 
the same order as they were given. 


TaBLe I.—Trstine Timp, Tora Points, MEANS, AND VARIABILITY OF SCORES 
on Four Tyres or Tests Usgep 1n Experiment I. N = 99 

















Testing Range 
Test time in| Total) “of |Mean| PE | SD | PE | CV 
min- | points 

scores 

utes | 
are ee 35 69 20— 61) 39.27|+ .62) 9.14/+ .44/23.27 
Completion.......... 30 94 | 25— 78) 49.16;+ .78)11.44;+ .55/23.27 
Multiple-choice....... 20 39 1l— 38) 29.41;+ .33} 4.91;/+ .24|16.70 
True-false............ 15 50 | 28— 46) 39.35)+ .23) 3.34/+ .16) 8.49 
Composite of four tests} 100 | 252 |109-217/154.89) + 1.57/23.20)+1.11|14.98 
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The scoring of the essay test was somewhat more objective than 
is usually the case. Each student’s paper was compared with the 
complete set of answers previously worked out, and only those points 
which appeared on the master copy were given credit. For the comple- 
tion and multiple-choice examinations, the number of correct responses 
was considered as the score. On the true-false test the score was 
determined by subtracting the number of wrong responses from the 
number right. 


TaBLE II.—Testinc Time, Tota Points, MEANS AND VARIABILITY OF ScorEs 
ON Four Types or Tests Usep 1n Experiment II. N = 106 









































Testing 
time in | Total mange 
Test . : of Mean; PE | SD PE | CV 
min- | points aia 
utes 
NS dn in a wa aes 35 118 12— 76; 40.80|;+ .80)12.15)+ 56.29.78 
Completion.......... 30 | 96 | 16— 79) 55.28)/+ .84/12.90/+ .60/23.34 
Multiple-choice....... 20 69 | 29- 62) 48.44;+ .48) 7.35)+ .34/15.17 
True-false............ 20 75 5—- 61) 39.25)+ .75)11.40)+ .53)/29.04 
Composite of four tests} 105 | 358 | 62-258)171.33) +2.37|36.20 +1.68)21.13 
RESULTS 


In the first experiment the four tests varied in their differentiating 
capacity as shown in Table I. An examination of the figures in the 
appropriate column reveals the fact that the standard deviations on 
the essay and completion tests are larger than on the multiple-choice 
and true-false examinations. The same is true of the coefficients of 
variability. This difference is without question due to the nature of 
the particular examinations used rather than due to the type of test. 
The variability measures for Experiment II (Table II) show that the 
essay, completion, and true-false examinations appear to differentiate 
the students to approximately the same extent. Had the multiple 
choice test been made longer it is very probable that it would have 
differentiated the students equally well. 

In order to determine the extent to which the four types of tests 
measured the same degree of achievement, their intercorrelations 
were found. The coefficients for both experiments are given in Table 
III. Considering the tests as given in the first experiment it is evi- 
dent that the highest intercorrelation appears between the multiple- 
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TaB.Le II].—INTERCORRELATIONS OF TESTS 









































Experiment I | Experiment II 

| | [Corrected | —_|Corrected 
for for 

r| Fe attenu- .| FS | attenu- 
ation ation 

5) 

Essay and completion............... .44+.05) .62  |.80/+.02| 1.20 
Essay and multiple-choice............ .47,'+.05) .67 |.63)+.04) .97 
Essay and true-false.................|.30)4 .06 .66 .55) + .05 .89 
Completion and multiple-choice....... .57| + .05 .80 .71) + .03 .92 
Completion and true-false............|.37| 4.06 .68 .63) + .04 .85 
Multiple-choice and true-false........ .44) + .05 81 .65) + .04 .90 





choice and true-false types (r = .57). The lowest exists between the 
essay and true-false examinations (r = .30). When corrected for 
attenuation, the multiple-choice test correlates about equally as high 
with the completion (.80) as with the true-false type (.81). In the 
second experiment all the intercorrelations are higher. The coeffi- 
cient of greatest magnitude is found for the essay and completion 
tests (.80). The lowest correlation obtains between the essay and 
true-false tests (.55) as in the first experiment. When corrected 
for attenuation the coefficients in Experiment II range from .85 to 
1.20.1 These high coefficients indicate that if a reliable test is con- 
structed one type is probably as adequate for measuring the acquisi- 
tion of information in a course as any of the other three types. The 
fact that the intercorrelations are lower in the first experiment than 
in the second is probably due to the lower validity and reliability of 
the first set of tests in comparison with the second, ae the sub- 
ject-matter of the tests may also be a factor. 

To evaluate these facts further, correlation sities were 
calculated for the scores on each test and composite scores on the other 
three. For example, the essay test was correlated with the composite 
score of the multiple-choice, true-false and completion examinations. 
The same procedure was followed for each of the other tests. These 
correlations, which may be called validity coefficients, have been 
placed in Table IV. The values for the tests as used in the first 
experiment appear in the second column as follows: For the essay, .64; 


‘A corrected correlation that is greater than 1.00 must be interpreted to mean 
that the relationship is near unity. 
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for the completion, .68; for the multiple-choice, .66; and for the true- 
false, .40. It was further sought to determine the validity of the 
tests in case each of the four types was made of equal length; that 1s, 
in case the testing time for each was sixty minutes. Thus, the validity 


TasLe IV.—Txue Revation or Eacu Type or Test TO THE COMPOSITE OF THE 
OTHER THREE TyPEs 





























Experiment I Experiment II 
If made If made 
Tests As constructed ae ee As constructed = 
. length length 
r PE r r | PE r 
Se —— a pinationdome: = 
ee aes .64 + .04 .69 .76 + .03 84 
Completion.......... .68 + .04 .73 .82 + .02 . 86 
Multiple-choice...... .66 + .04 .74 .77 + .03 . 84 
True-false........... .40 + .06 .54 .75 + .03 84 
Estimated validity of | 
composite of four | 
Miia h é'sw a6 keer t .70 | af as v4 . 86 | 











coefficients are estimated for the essay test when less than doubled in 
length, for the completion test when doubled, for the multiple-choice 
test when tripled, and for the true-false test when quadrupled. To do 
this the formula for determining the validity of a lengthened test as 
given by Holzinger' was used. The coefficients obtained in this 
manner are as follows: Essay, .69; completion .73; multiple-choice, 
.74; and true-false, .54. 

The validity coefficients for the tests used in Experiment II were 
likewise calculated. These are also inserted in Table V. For the 
essay test as constructed, the validity coefficient is .76; for the com- 
pletion, .82; for the multiple-choice, .77; and for the true-false, .75. 
When estimated for the tests if made of equal length, the coefficients 
are all of approximately the same magnitude. They indicate, there- 
fore, that the four types of tests used in these experiments have 
approximately the same validity (providing the criterion is an adequate 
one). The surprising implication of this fact is that the essay type of 
examination, when scored objectively, appears to be as valid as the 
other so-called objective examinations. 





1 Holzinger, Karl J.: “Statistical Methods for Students in Education.” P. 170. 
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The reliability coefficients presented in Table V were found by 
securing the Pearsonian r between the odd and even items of the tests 
and estimating with the Spearman-Brown formula. In both of the 
experiments the completion and multiple-choice examinations are 
most reliable. In the first, the true-false test has the lowest reliability 
whereas the essay examination is least reliable in the second. When 
the reliability coefficients are estimated for tests of equal length, the 
completion and multiple-choicé types retain their positions as the most 
reliable tests in both experiments. The reliability of the essay exami- 
nation in the first experiment does not differ markedly from these 
whereas in the second experiment, the essay examination appears to be 
considerably less reliable. The reliability coefficient for the composite 
of the four tests of each experiment was determined by an adaptation 
of Spearman’s formula for the correlation between sums or averages.! 
The coefficients show that the two batteries of tests have approximately 
equal reliability. 


TABLE V.—RELIABILITY OF TESTS 


’ 


























Experiment I , Experiment II 
io If made sixty ‘ii If made sixty 
Tests minutes in | minutes in 
; constructed lenat constructed length 
r| PE |r| PE |r|]. PE |r| PE 
Aint: Sooke Raa PEE ead 
ae .69 + .04 79 + .03 .56 +.06 | .69 + .04 
Completion................| .72 + .04 84 + .02 . 80 +.03 | .89 + .02 
Multiple-choice.............| .71 + .04 88 +.02 | .75 +.03 | .90 + .01 
po eer: a + .08 74 + .05 .69 + .04 87 | +.02 
Composite of four tests...... < Beet re te ree | .89 














Another matter of concern was the relationship of these various 
types of examinations to intelligence. The only intelligence test scores 
available for the group of students enrolled in statistical methods were 
those on the Miller Mental Ability Test, Form A. The coefficients 
of correlation between this test and the various types of achievement 
examinations appear in Table VI. The zero order coefficients show 
that the highest degree of relationship exists between Miller A test and 
the completion examination (r = .53), the lowest between the Miller A 


1 Kelley, T. L.: ‘Statistical Method.” P. 198; Douglas, H. R. and Cozens, 
F. W.: On Formula for Estimating the Reliability of Test Batteries. Journal of 
Educational Psychology, Vol. XX, 1929, pp. 369-377. 

* Miller, W. S.: ‘Mental Ability Test.” World Book Company. 
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and the essay examination (r = .33). When corrected for attenuation, 
the coefficients of correlation between the Miller Mental Ability Test 
with the completion (.68) and with the true-false examination (.67) 
are approximately equal, whereas the lowest relationship is found with 
the essay examination. 


ee 7 
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TasBLeE VI.—RELATION oF TESTS TO INTELLIGENCE 





ag Experiment I Experiment II 

ee Miller Mental Ability Test) Miller Hard Analogies Test 
ry Tests - : 

Corrected for Corrected for 


' 4 PE attenuation! “4 PE attenuation! 

















Sats SS 
Se 


.06 .43 . 34 
.05 .68 .44 
.54 .53 
.06 .67 55 
.05 . 57 53 


.48 
. 52 
.65 
.70 
.59 


i Essay eo ecccccecesccsecees .33 
| Completion............... .53 
ae Multiple-choice ........... .42 
Nea ee .40 
; Composite of four tests... .. .49 
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fe 1 The self correlation for the Miller A test is .86, and for the Analogies test, 


A 90 
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In the educational psychology class the Miller Hard Analogies 
Test! was used as a measure of intelligence. With this test, which 
is better adapted to the college group than the Miller Mental Ability 
Test, the relationship between the essay examination and intelligence 
is also the lowest of those obtained. It would seem, therefore, that 
students with relatively good intelligence are less apt to secure high 
scores in the essay examination than on the other three. This may 
be due to the fact that the form of the essay type is much less like the 
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form of intelligence tests than are the completion, multiple-choice, and 
true-false types. 
After the students completed all four parts of the examination, , 

they were asked to write out their order of preference for these types. 

These data are summarized in Table VII for Experiment I and in 

Table VIII for Experiment II. In both classes the multiple-choice 

\ examination is placed first by a larger percentage of students than 
‘any of the other types. ‘The true-false is placed second. In the first ' 
experiment the plurality gives the essay examination the third position 
and the completion examination the fourth. In Experiment IJ, | 


ee a 





1 Not commercially available. Prepared by W.S. Miller. 
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however, the plurality gives the completion the third position and the 


essay, the fourth. 


It is clear that the multiple-choice and true-false 


examinations are preferred by the students'to the essay and com- 


pletion types. 


TaBLeE VII.—StupENT PREFERENCES FOR DIFFERENT Types or TESTS 


Experiment I 





Type of examination 











Choice Essay Completion | Multiple-choice| True- false 

N |Percent} N |Percent| N /|Percent| WN / Percent 
Raa 14 14.9 6 6.5 40 43.0 39 40.6 
Second...... 14 14.9 5 5.4 34 36.6 39 40.6 
ee 5 6 wc 44 46.8 22 23.9 15 16.1 11 11.5 
Fourth...... 22 23.4 59 64.1 4 4.3 7 7.3 
Totals..... 94 100.0 92 99.9 93 100.0 96 100.0 





























TasBLe VIII.—Stupent PREFERENCES FOR DIFFERENT Types or TESTS 


Experiment II 





Type of examination 

















Choice Essay Completion | Multiple-choice| True-false 
N Per cent; N | Percent) WN | Per cent N_ | Per cent 
ee 11 11.5 13 13.4 58 59.8 18 18.4 
Second...... 9 9.4 16 16.5 27 27.8 44 44.9 
BI coccices 27 28.1 38 39.2 9 9.3 22 22.5 
Fourth...... 49 51.0 30 30.9 3 3.1 14 | 14.3 
Totals..... 96 100.0 97 100.0 97 100.0 98 | 100.1 


























Further analysis was made of these data by tabulating the order 
of preference for the fifteen individuals securing the highest score and / 
for the fifteen securing the lowest score on the intelligence tests. 
Table IX presents the figures for these selected groups in the statistical 
methods class while Table X presents comparable data from the 
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second experiment. 
from these two tables, it appears that both those of low and high 


intelligence prefer the multiple-choice and true-false tests more than 





Although no generalizations can be derived 


they do the other types. In the first experiment there is a tendency 


TaBLE I1X.—Terst PREFERENCES FOR THE FirrEEN HIGHEST RANKING STUDENTS 
AND FOR THE FirTrEEN Lowest RANKING STUDENTS IN INTELLIGENCE 


Experiment I 





Type of examination 








True-false 














Choice Essay Completion | Multiple-choice 
Highest| Lowest | Highest} Lowest ‘Highest! Lowest 
Peebis  k 0 5 1 0 7 4 
Second...... 3 2 2 a ae 6 
eee 7 7 4 ee ee ee er ae 
Fourth...... 5 1 s 12 ee at 
Totals..... 15 15 15 15 | 16 | 15 














Highest, Lowest 
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TaBLE X.—Tgst PREFERENCES FOR THE FIFTEEN HiGHest RANKING STUDENTS 


AND FOR THE FirTeEN Lowest RANKING STUDENTS IN INTELLIGENCE 

















Experiment II 
Type of examination 
Choice Essay Completion | Multiple-choice True-false 
-— 
Highest} Lowest | Highest] Lowest | Highest) Lowest Highest Lowest 

First........ 3 3 1 0 | 10 7 “ae: a 
Second...... 2 2 3 1 2 5 ee Se 
Third....... 2 4 6 5 1 3 ca. 3 
Fourth...... ee ae. 5 9 2 0 o | 0 
Totals..... 15 15 15 15 15 15 15 | 15 























for those lowest in intelligence scores to prefer the essay examination 


more than do those who received the highest ratings. This tendency is 
In both classes, 


not as clear in the results from the second experiment. 


there is a slight tendency for those with the highest intelligence ratings: 
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to prefer the completion examination more than do those individuals 
who rank lowest. 

Before summarizing the results of this study, a possible explanation 
might be given for the lack of agreement between the results reported 
here and those reported by other investigators who have considered 
the relative merits of the essay examination. In all probability the 
reason for the relatively high rating of the essay examination in this 
study is the objective manner in which it was scored. The procedure 
followed in scoring these tests is definitely comparable to the method 
of scoring the so-called objective examinations. Consequently, 
this evaluation of the essay examination does not serve as a refutation 
for the numerous arguments which appear in the literature opposing it. 
It merely intimates that an essay-type examination may be so given 
and scored as to compare favorably with the objective examinations. 
The labor involved in such a procedure, however, is so great as to 
make it almost prohibitive for large classes. 


SuMMARY 


1. Four types of examinations, each covering exactly the same 
subject-matter, were given to a class in statistical methods in education 
and a class in educational psychology. The nature of the four types, 
as well as the order in which they were given, was as follows: Part 
I, essay; Part II, completion; Part III, multiple-choice; Part IV, 
true-false. , 

2. The intercorrelations of the tests in the course in educational 
psychology suggest that if reliable tests are constructed, one of the 
four types used is probably as adequate as any of the other three for 
measuring the amount of information which the members of the class 
have accumulated. The evidence for this suggestion is not as clear- 
cut in the class studying statistical methods. 

3. If the composite score on three types of examinations is used 
as the criterion for estimating the validity of the fourth type, the 
results indicate that the four types of tests have approximately equal 
validity. 

4. In both experiments the completion and multiple-choice tests 
prove to be most reliable. In the second experiment the reliability 
of the true-false examination is not much lower. 

5. The correlation between intelligence and the essay examination 
is lower than between intelligence and either of the other three types. 
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6. Considering the choice of all the students in the two classes, 
the multiple-choice and true-false tests are preferred to the other two 


types studied. 

7. There is a tendency for the highest ranking students in intelli- 
gence to prefer the completion examination more than do the lowest 
ranking students. In the first experiment this latter group prefer the 
essay test more than do the members of the highest ranking group. 
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AN EXPERIMENT DESIGNED TO TEST THE VALIDITY 
OF A RATING TECHNIQUE 


THEODORE NEWCOMB 


Western Reserve University 


I 


The validity of ratings of behavior, provided that two conditions 
are met, has been fairly generally accepted. These conditions are that 
there be several competent judges,' and that they have ample oppor- 
tunity to observe the behavior being rated. For the past two summers 
the writer has been in a situation where both of these conditions were 
adequately met. Data were gathered of such a nature that ratings 
could be compared with more objective measures of the same behav- 
iors; and enough raters were available so that a study of the uniformity 
of the ratings could be made. 

The subjects for each of the studies here reported were thirty 
problem boys who had been sent for study and treatment to a summer 
camp maintained through the cooperation of Western Reserve Uni- 
versity and the Child Guidance Clinic of Cleveland, Ohio. The boys 
remained for a period of five weeks where they were under the constant 
observation of a psychiatrist and of six or more counselors trained in 
psychology and mental hygiene. 

The first summer, in connection with another experiment, the 
attempt was made to validate certain ‘“‘objective’’ measures of prob- 
lem boy behavior by comparing them with ratings on the same behav- 
ior. The results were such as to make the experimenter conclude that 
the ratings were themselves being validated. The evidence, however, 


was rather meager, and it was decided to gather fuller data concerning 
the same problem. 


II 


During the first summer a daily record was kept of ‘‘specifically 
remembered incidents” included under twenty-six different behaviors.” 





_ 1 Note particularly Rugg, H. O.: Is the Rating of Human Character Possible? 
Journal Educational Psychology, Vol. XII, p. 425 and Vol. XIII, p. 93. Rugg 
concludes that there should be at least three judges who have had a fair chance to 
observe subjects in the behaviors being rated. 

* For a fuller description of this experiment, together with a description of 
statistical techniques used, see the writer’s ‘‘Consistency of Certain Extrovert- 
introvert Behavior Patterns in Fifty-one Problem Boys.” Teachers College 
Contributions to Education, No. 382. 
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Each boy’s record each day was kept on a form similar to the following 
sample: 


How was he about getting up in the morning? 
TERT WT a eRe a 


ea Sit, CaN ed sa wsvact¥icwenea. 
Dallied, but on time for breakfast.........................2.... 
Too late to be on time for breakfast.....................02.005. 


Records were kept for each boy by his own counselor, who had spent 
all or most of the day with him. Besides these daily records, the 
experimenter recorded some 8500 incidents of the above twenty-six 
and other similar behaviors. Then, at the end of the camp period, 
ratings were obtained from each of seven men on the frequency of 
these same twenty-six behaviors. Inasmuch as the observers had 
played, slept, and eaten with the boys nearly constantly for five weeks, 
a fair degree of accuracy in the ratings might be expected. 

Six of the twenty-six behaviors measured, however, were of such a 
kind that no subject had been observed by any but one man—his own 
tent counselor. Nevertheless, each of the seven observers was asked 
to give ratings on these six behaviors also, in terms of supposed or 
imagined frequency. Six of the seven ratings, in other words, were 
guesses. 

It was not surprising to find a mean correlation between daily 
record scores and ratings, of .409 + .102 for the twenty-six behaviors. 
The range of correlations was from .02 to .73. What was somewhat 
surprising was that the mean correlation for the six guessed behaviors 
(z.e., between daily records and ratings) was fully as high as for the 
twenty observed behaviors.» The mean for the former was .451 + 
.098, and for the latter, .396 + .104. The difference is not statistically 
significant. 

The accuracy of the ratings became even more questionable when, 
as measures of uniformity, standard deviations among the seven ratings 
for each behavior were computed. The mean SD for the twenty 
observed behaviors was .88 (the ratings were on a scale of 1 to 5); and 
for the six guessed behaviors, .81. Again the difference is not sig- 
nificant, and it is evident that the raters agreed no more closely about 
frequently observed behaviors than about behaviors which they had 
never seen. 

III 


A year later there was a similar group of boys at the same camp. 
Somewhat different means of objective recording of behavior were 
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used, as will be described. More extensive ratings were also obtained, 
and the problem of studying their relation to frequency of observation 
was directly attacked this time, rather than stumbling on to it as a 
side-issue. 

As before, definite types of behavior were being studied; for 
example, seeking adult attention. Cards were mimeographed, on 
which were listed all the common ways in which boys were actually 
found to seek adult attention. When an incident of this sort was 
observed, the card was checked with the appropriate behavior, at 
that time or as soon after as was possible. Seven or eight observers 
who were constantly on the watch for these specific behaviors could 
be presumed to get a pretty fair sample of all such incidents that 
occurred. ‘There follows a sample card, showing one of the seven 
kinds of behavior that were being studied. 


SEEKING ADULT ATTENTION 


enc be Racha Coe Ou eC ie oh, bee Wee 
ag re ee a bis 
GEIR ES a, SO rs ele ee, Ea? a a 


Voluntarily working alone with counselor 
Hanging around counselor alone................................ 
Volunteering in task for counselor 
Hanging around counselors’ shack 


Antics to annoy or draw attention of counselor................... 
Demonstration of affection for counselor......................... 
EE a Oe a 
Others (indicate on reverse side) 


Altogether some fifty specific behaviors were being thus recorded 
day by day. Naturally, some of these were found to occur with great 
frequency, and others to be comparatively rare. This matter of fre- 
quency with which the behavior is being reported may therefore be 
studied in relation to the agreement between the two measures of the 
same behaviors. 

It was desired to study not only the effect of frequency of reporting 
upon the agreement of the two measures, but also that of mere fre- 
quency of occurrence of behaviors not being recorded. These prob- 
lems, added to that of the previous summer, 7.e., a comparison of 
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observed with unobserved behaviors, made a total of six kinds of behay- 
iors to be studied, as follows: 


1. Behaviors being recorded in large numbers. 

2. Behaviors being recorded in small numbers. 

3. Behaviors not being recorded, but occurring frequently (as determined by 
the study of the previous summer). 

4. Behaviors not being recorded, but occurring rarely (as determined by the 
study of the previous summer). 

5. Behaviors observed in each boy by only one observer (his own counselor). 

6. Behaviors never observed at all—imaginary situations. 


Ratings were therefore obtained on thirty behaviors, five in each of 
the categories just given. In Table I these behaviors are tabulated, 
together with the results of the ratings. In the list as submitted to the 
raters, the behaviors were of course not thus tabulated, but occurred 
in random order. The number preceding each behavior indicates 
what this random order was. 

Directions for ratings were as follows: 


The following ratings are to indicate the frequency with which each behavior 
occurred. Please rate as follows: 

0. Never occurred at all. 

1. Rare, very exceptional. 

2. Occasional; not frequent. 

3. Fairly frequent. 

4. Very frequent; prominent characteristic. 

Indicate the degree of your rating by placing a plus after the figure if there is a 
fair degree of certainty, a minus if there is considerable uncertainty, and otherwise 
no sign. 

In the case of behaviors which you have not observed, make the best possible 
estimate of how frequently it would occur. 

Rate all the boys for behavicr No. 1, then alli the boys for behavior No. 2, etc. 


TaBLeE I.—BeEHAVIORS ON Wuicu Ratincs WERE OBTAINED 


Group I. Recorded in Large Numbers 


5. Lying or sitting around alone. 

7. Bullying, with physical contact. 
13. Assuming position of prominence at camp fire. 
15. Antics to annoy or draw attention of counselors. 
27. Interrupting or shouting above the group. 


Group II. Recorded in Small Numbers 


1. Being crowded out; avoiding being first. 

3. Taking the initiative in fighting or boxing. 
18. Complaining to counselor of ailment. 
20. Moping or sulking alone. 
24. Profanity or obscenity. 
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Group III. Not Recorded—Frequently Observed 


2. Coming late to meals. 
11. Expressing enthusiasm. 
17. Moving slowly while others hurried. 
21. Singing aloud. 
23. Laughing. 


Group IV. Not Recorded—Rarely Observed 


10. Boasting, or loudly announcing intentions for the future. 
12. Continued chatting; loquaciousness. 

14. Remaining quiet while others were noisy. 

19. Accepting discipline quietly. 

22. Failing to wash or comb. 


Group V. Observed by Only One Counselor 


4. Making his own bed neatly. 
8. Taking more than his share of food at table. 
26. Readily concurring with tent-mates’ choice of activity. 
29. Doing more than his share of after-meal work at camp. 
30. Making noise in the tent before rising hour in the morning. 


Group VI. Never Observed—Imaginary Situations 


6. Stopping play at a previously agreed upon hour and coming in the house. 
9. Playing quietly in order not to disturb sick sister, if asked to do so. 

16. Minding the baby all Saturday afternoon, without leaving. 

25. Giving up attractive plaything to younger brother if he asked for it. 

28. Tidying up his own room at home. 


Thirty boys were thus rated on thirty behaviors by six observers. 
As one measure of the uniformity of the ratings, the SD of the six 
ratings was calculated, for each boy for each behavior. Correlations 
between ratings and objective scores were also calculated for the ten 
behaviors being recorded (Groups I and II). Rating scores for these 
calculations were obtained by adding the six ratings for each behavior 
by each boy. Objective scores were simply the total numbers of 
recorded incidents. 

Reliabilities of the ratings were also calculated for each behavior, 
by correlating the sum of the scores of three raters with the sum of the 
other three raters. This serves as another measure of uniformity 
of the ratings. The correlation between these measures of reliability 
and the SD’s for each behavior, is —.652, indicating that the two 
methods are measuring at least partly the same thing. 

The results of all these calculations are given in Table II. 
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bar TaBLe II 
i : ; | Correlation x 
i Behavior Mean of thirty; Correlation |_. esi Number of 
oh with objective he 
‘4 number SD's between halves | incidents 
a - i ae Fe 
| Group I—Observed and Recorded Frequently =—_— 
5 .7102 .632 .189 123 
7 .7218 .730 .542 98 
13 .8004 .276 .662 69 
‘| 15 .6642 .815 .652 | 133 
Hy. 27 .8009 .783 .646 100 
| _ Mean 7395 | -647 538 | —:104.6 
eS Group II—Observed and Recorded Rarely cee | 
. 1 .7065 .735 .019 21 
h 3 .6956 .861 .349 19 
| 18 .6458 .783 .443 25 
20 .7119 .928 | .458 11 
24 .6315 .905 | .681 34 
| __ Mean .6782 .842 | .390 22.0 
i Group III—Not Recorded—Frequently Observed 
Has 2 .7405 .507 
11 .7466 .703 
17 .8213 .154 
21 .8040 .500 
| 23 .7580 .609 
4 Mean .7741 .494 
fey “ Group IV—Not Recorded—Rarely Observed 
bi 10 7182 .635 | 
RE 12 7165 754 | 
Pe 14 .7977 778 | 
ie 19 7166 801 | 
1 ote 22 .6943 .850 | 
Mean .7289 .763 | ; 
Group V—Not Observed Except by One Man eas. 
4 .7151 .750 
8 .6697 .548 
26 .7553 .856 
29 .7152 .901 | 
ig 30 .8306 .479 | 
Mean .7372 .706 Las. 
Group VI—Never Observed—Imaginary Incidents i 
6 .8107 .486 
9 .7096 .474 
16 .6629 .739 
25 .7290 .791 
28 .6899 .729 
Mean .7204 .643 _— 
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Considering first the correlations between the two measures, for 
which calculations were possible only in Groups I and II, the mean 
coeficients are respectively .538 + .087, and .390 + .105.. The SD 
of the difference of these two coefficients, using the formula 


SDais;. = SD* mean I + SD? mean II, 


is .1254, the difference being equal to 1.10 sigmas, which can not be 
said to be statistically significant. As far as agreeing with the more 
objective measures is concerned, then, there is no significant difference 
between behaviors frequently recorded and those rarely recorded. 

The relative uniformity of the ratings for each group of behaviors 
isfairly obvious from Table II. The mean SD for the groups of behav- 
iors ranges between .6782 for Group II and .7741 for Group III. The 
SD of the difference between these two values (using the formula 
already cited) is .0282, the difference being equal to 4.6 sigmas. While 
this is a great enough difference to indicate probable significance, 
there is no significant difference between any other two groups of 
behaviors. The difference in this one case would seem to indicate 
that behaviors being recorded in small numbers are more uniformly 
rated than those being frequently observed and not recorded; but not 
significantly more consistently rated than those which were pure 
guesses. Groups I and III, which represent behaviors frequently 
observed, have a mean SD of .7395 and .7741, respectively, whereas 
Groups V and VI, representing behaviors never observed, have mean 
SD’s of .7372 and .7204, respectively. Neither the fact of being 
observed frequently, nor having been watched for and reported for a 
five-week period lent any advantage in uniformity of rating over the 
ratings which were mere guesses. 

There was a barely significant difference between behaviors No. 24 
and No. 17, those with the lowest and highest mean SD’s, respectively. 
The sigma of the difference is .0427, the difference being equal to 
4.42 sigmas. 

Mean SD’s for each boy were also calculated; i.e., the mean of the 
thirty behaviors on which each boy was rated. There was somewhat 
more uniformity here than among the thirty behaviors. The lowest 
and the highest of these mean SD’s were, respectively, .6615 and 
8163. The SD of the difference is .051, the difference being equal 
to 3.03 sigmas. | 

That this scarcely represents a real difference is further indicated 
by a calculation of the correlation between the number of recorded 
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incidents for each boy (ranging from 26 to 295) and the mean SD 
for each boy. This coefficient, representing the relation between the 
number of recorded incidents and the variation in ratings was .161 + 
.120. This is probably not surprising, since the mere fact of not 
having recorded many incidents for certain boys would give the 
rater a fair idea of the frequency of the behavior being rated. Hence 
boys with few recorded incidents would tend to be as accurately and 
uniformly rated as those with many. 

Very similar results are obtained by comparing the average relia- 
bility of the several groups of behaviors. The only significant differ- 
ence, again, is between that of Group II and Group III, the difference 
being equal to 4.1 sigmas. We find the average reliability of the five 
guessed behaviors to be almost precisely the same as that of the five 
behaviors most frequently seen and recorded. 

The mean correlation between one half of the ratings and the other 
half is .683. This gives a reliability, by the use of the Spearman- 
Brown formula, of .81. In order to secure a reliability of .90 it would 
be necessary to have 2.1 times as many raters, or about twelve. It 
should be added that the boys presented a wide range of behaviors, so 
that a fairly high degree of reliability should be expected. 

Another aspect of ratings which interested the writer was the 
comparative uniformity of ratings of specific behaviors and of gen- 
eralized traits. Ratings were obtained on but two such traits, as 
follows: 


1. Degree to which each boy was an acceptable member to those of his own 


tent group. 
2. Degree to which each boy was an acceptable member to the majority of the 


thirty boys in camp. 


Ratings were made on a scale similar to that used in the specific 
behavior ratings, and the mean SD for each trait was similarly cal- 
culated. The mean SD’s were, respectively, .6705 and .6520, for the 
above traits. The self-correlations were .885 and .655 respectively. 
While these results are not significantly different from these shown in 
Table II, they show comparatively high uniformity, and further meas- 
uring of other traits might have shown a tendency to rate traits more 
uniformly. It is to be regretted that more data were not obtained 
at this point. 

In the above calculations the degree of certainty which was indi- 
cated along with the numerical values for the ratings, has not been 
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considered. Would the results have been different if the numerical 
values had been weighted accurding to the degree of certainty? 

Table III reveals surprising differences in certainty among the 
six groups Of behaviors. There is a steadily decreasing certainty 
from Group I to Group VI. The calculations shown in Table III 
were made by finding the algebraic sum of the number of boys rated 
with certainty (marked plus) and the number of boys rated with 
uncertainty (marked minus) for each behavior. 

Now, with such decided differences in the certainty of the ratings 
for the various groups of behaviors, one might expect to find signifi- 
cant differences in uniformity of rating if certainty is taken into 


TaBLE III.—DecrReE or CERTAINTY OF RaTINGs FoR Eacu BEHAVIOR 






































Group I Group II Group III 
Behavior i Behavior . Behavior es 
asain Certainty RE oe Certainty iinet Certainty 
| 
5 41 | 1 18 2 —2 
7 ll 3 27 ll 12 
13 46 18 10 17 29 
15 54 20 24 21 4 
27 37 24 46 23 38 
Mean 37.8 Mean 25.0 Mean 17.0 
Group IV | Group V Group VI 
10 9 4 —5 | 6 —23 
12 23 8 —ll1 g — 42 
14 32 26 15 16 —10 
19 20 29 37 25 —22 
22 19 30 bom § 28 —10 
Mean 20.6 Mean | 6.6 Mean —21.4 

















account. Calculations were therefore made according to the following 
system of weighting: Ratings marked certain were weighted three 
times; those marked neither certain nor uncertain were weighted 
twice; and those uncertain, once. 

Behavior No. 9 was rated with less certainty than any other 
behavior. Its mean SD as given in Table II is .7096: As recalculated 


according to the above weighting, the mean SD is .6966—a negligible 
difference. 
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Behavior No. 15 was rated with the most certainty. The weighted 
mean SD is .6642; the weighted mean SD is .6552, again a negligible 
difference. 

A third calculation was made, chosen at random. The mean 
unweighted SD for behavior No. 1 is .7065; weighted, .7067—prac- 
tically identical. It was therefore not thought worth while to recaleu- 
late the other behaviors, since the differences are obviously insignificant. 

The failure of this factor of certainty to render more uniform the 
calculations may be explained by a further correlation. The relation 
between certainty and uniformity (as measured by the mean SD) is 
—.0015. This means, of course, that raters differed as much among 
themselves when fairly certain as when uncertain. The only con- 
clusion to be drawn from the estimates of certainty, then, is that 
they tended to confirm the basis on which the six groups of behaviors 
were chosen. The experimenter was evidently fairly successful in 
choosing those behaviors which should be rated with the most and the 
least certainty. The only difficulty was that certainty and uniformity 
had little or nothing to do with each other. 


IV 


Some light may be thrown on possible reasons for these results by 
again referring to the study of the first summer. The twenty-six 
behaviors then measured were chosen as being classifiable under some 
one of nine traits supposedly indicative of introversion-extroversion. 
Intercorrelations of all behaviors grouped under a given trait were 
calculated for both sets of data, 7.e., ratings and objective records. 
The mean intercorrelation of these one hundred twelve intra-trait 
behaviors, according to the ratings, was .493 + .093; the mean of 
the same one hundred twelve intercorrelations, according to the 
objective records, was .141 + .121. The conclusion may therefore 
be drawn that the halo effect, inevitable in the ratings, worked in 
such a way as to cause the rater to rate similarly logically related 
behaviors such as those classifiable under a single trait. The close 
relation between the intra-trait behaviors which is evident in the 
ratings may, therefore, be presumed to spring from logical presupposi- 
tions in the minds of the raters, rather than from actual behaviors. 

This hypothesis helps to account for the failure of the frequently 
observed behaviors to be rated more uniformly than those rarely 
observed, or those never observed at all. There is, of course, one 
other possibility—that the frequently observed behaviors are, indeed, 
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accurately rated and that, by inferring from familiar behaviors, the 
imagined situations are rated with equal accuracy. To take this 
latter position, however, is equivalent to saying that behaviors may 
be accurately rated by guessing at them when they have never been 
observed. Few psychologists will be willing to concede this. It 
seems a truer statement to say that even with ample opportunity for 
observing, the ratings were little or no better than guesses. 

The results of the study of degree of certainty tend to confirm 
this view. Raters agreed no more closely when fairly certain than 
when uncertain. The writer ventures a suggestion as to why this was 
true. Assuming, as above, that the halo effect was at work almost 
equally in rating observed and unobserved behaviors (since the latter 
were as uniformly rated as the former), in the case of the former the 
raters were able to recall incidents supporting their “halo-ized”’ 
impression—conveniently forgetting, of course, the incidents which 
did not support this impression. Such ratings were therefore marked 
certain. In the case of the unobserved behaviors no such incidents 
could be called up; the impression remained, but could not be factually 
supported, and was therefore marked uncertain. That the raters 
agreed to a considerable degree indicates that their halos coincided to 
some extent. That they differed indicates that each rater had his 
own private halo. 

The writer concludes, therefore, that under these conditions, 
apparently optimum for rating purposes, ratings on specific behaviors 
were so largely colored by the surrounding halo as to be quite invalid, 
Neither frequent observation of the behavior being rated, frequent 
recording of it, nor weighting the scores in accordance with the degree 
of the rater’s certainty had an appreciable effect on uniformity of 
rating. Behaviors never recorded, never seen, and those felt to be 
highly uncertain were as uniformly rated as those at the opposite 
extremes. Since the guessed ratings are of highly questionable valid- 


ity, must not the others, which are no more uniform, be almost equally 
invalid? 
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A METHOD FOR JUDGING THE DISCRIMINATION OF 
INDIVIDUAL QUESTIONS ON TRUE-FALSE 
EXAMINATIONS! 


C. H. WHELDEN, JR. AND F. J. J. DAVIES 
Yale University 


I. INTRODUCTION 


The following introductory comment on the scoring of true-false 
examinations is merely for the purpose of making clearer some of the 
later portions of this article. It is generally recognized that whatever 
other method of scoring a true-false examination may be used, the 
method of scoring by the total of right answers is not satisfactory. 
If the number of questions is at all large, a man would be apt to get 
half his answers correct by sheer guessing. 

A number of methods of scoring designed to eliminate the effects 
of guessing are possible, and the method adopted for any given case 
will depend upon the particular conditions of that case. One such 
method is the following, which was adopted at the Yale Law School 
for its true-false examinations. Preliminary warning is given that 
guessing will be penalized, that if an answer has to be guessed it had 
better be omitted entirely. The examination is scored by adding the 
sum of omitted answers to twice the sum of wrong answers. The 
lowest score is then best and the highest is worst. The theory under- 
lying the method is that guessing will be greatly minimized if not 
entirely eliminated, and that simple lack of information on a given 
question is not to be scored against so heavily as definitely wrong 
information on that question. 

The minimizing of guessing through the preliminary warning seems 
to be accomplished. A brief investigation has given about as certain 
an indication of that result as is possible. Out of seven examinations 
given in June, 1928 it was found on all but one that the group of men 
with the highest average law grades for the year (the top third of the 
group taking any one examination) had a smaller proportion of their 
examination scores accounted for by omitted answers than did the 
group of men with the lowest average law grades for the year (the 
bottom third of the group taking any one examination). On the sixth 





1 The substance of a report prepared for the Yale Law School and the Depart- 
ment of Personnel Study of Yale University. 
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examination, as they are listed in Table I, the proportion ran slightly 
the other way but not sufficiently so to destroy the significance of the 
evidence given by the other six. The indication might be taken to 
be that the best men guessed more than the poorest men. Logically 
the poorest men would be expected to do the most guessing, as anyone 
with much to gain and little to lose by taking a chance is most apt to 
take the chance. The indication, therefore, is rather that guessing 
was so well minimized that the poorest men did not reveal their 
relatively greater propensity in that direction and that the best men, 
not realizing their own limitations so clearly as the poorest, gave 
evidence of incorrect information where the poorest men admitted 
lack of all information. (See Table I.) 

One of the chief objects of a true-false examination and of its 
scoring, as in the case of any examination, is, of course, to discriminate 
adequately between men of different ability. A true-false examination 
correctly scored may give such discrimination fairly well, but at the 


TaBLE I1.—THE PERCENTAGES OF ToTaL Group Scores DuE TO WRONG ANSWERS 
AND TO OMITTED ANSWERS ON SEVEN TRUE-FALSE EXAMINATIONS, YALE 
Law Scuoo.u, June, 1928 














Group X, top third of the | Group Z, bottom third of 
men taking each exami- men taking each exami- 
nation—according to their | nation—according to their 
law grades for year law grades for year 
Course 
Per cent of total score due to} Per cent of total score due to 
Wrong Omitted Wrong Omitted 
answers answers answers answers 
I 88.3 11.7 84.7 15.3 
II 94.2 5.8 - 90.9 9.1 
Ill 93.8 6.7 92.9 8.1 
IV 92.8 7.2 92.4 7.6 
V 94.6 5.4 91.7 8.3 
VI 96 .0 4.0 96 .6 3.4 
VII 94.0 6.0 90.8 9.2 
Seven examinations 93.4 6.6 91.3 8.7 

















Nors.—The table is designed for the comparison of Group X with Group Z. 
If the table is used for a comparison of wrong answer percentage with omitted 
answer percentage within either group, it should be remembered that wrong 
answers are multiplied by two before going into the total score. 
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same time certain questions on the examination may give it much more 
adequately than others. A question, for example, may contain some 
subtle ambiguity which will distract the better men but will not 
trouble the poorer men at all; another may be so difficult that only 
the exceptional man can answer it correctly while all others fail to 
comprehend it regardless of the varying degrees of their actual ability. 
Another question may be so simple or fundamental as to be answerable 
by practically anyone who has been exposed to the subject-matter. 

In order that true-false examinations may be improved in con- 
struction so as to discriminate more adequately between men, it is 
important that a method shall be available for judging the relative 
power of discrimination of the different questions which have been 
included in such examinations in the past. As a preliminary to the 
development of any such method it should be recognized that wherever 
possible the basis for the judgment on the individual questions should 
be the same as the basis selected as correct for the scoring of the exami- 
nation. The discrimination given by the examination as a whole 
is only the net result of the discrimination given by the individual 
questions. 


II. THErory 


If the relative abilities of men in the general field under examination 
can be determined independently of the specific examination which 
is to be judged, it can be said that the examination, if it discriminates 
correctly, should give those men scores which will vary in accordance 
with the independently determined measures of their respective 
abilities; such scores may be called Standard Examination Scores. 
For the sake of initial simplicity in exposition, in spite of the criticism 
already made of scoring by the total of right answers, consider the 
case of an examination of one hundred questions scored on the basis of 
total right answers, but with guessing supposed to be non-existent. 

Suppose that the examination is so perfectly discriminating as to 
give a score of 80 to any man whose independently measured ability 
in the field is 80 on a scale of 100. Suppose there are one hundred 
such men. Each one will have given the right answer to eighty ques- 
tions. If all the questions are of exactly equal caliber, not all the 
one hundred men will have given the right answer on any one question 
or the wrong answer on any one question. By definition the questions 
are of equal caliber. In the hands of this group of men of equal ability 
each question will consequently be answered right by eighty men and 
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wrong by twenty men, that is, the standard man-score per question 
will be 80 on the basis of right-answer scoring. 

Similarly, if there is another group of one hundred men, each of 
whom has an independently measured ability of 60 on a scale of 100, 
each one of the one hundred questions of equal caliber will be answered 
right by sixty men in this group. If the men in the first group obtained 
scores of 60 instead of 80 and the men in the second group scores of 45 
instead of 60 (that is, if in the first group each question was answered 
right by sixty men and in the second group each question was answered 
right by forty-five men) the examination would have been one of 
greater difficulty, but each question, as well as the examination as a 
whole, would still be discriminating correctly between the men of 
the two grades of ability. The men of the second group are still 
shown as possessing three-fourths as much ability as the men of the 
first group. 

If one question, however, is answered right by eighty men in the 
first group and by eighty men in the second group, it does not dis- 
criminate between the two grades of ability. If all the questions were 
of the same caliber and gave this result, the score of each of the two 
hundred men on the examination would be 80 and the examination 
would not be discriminating as between the two groups of men of 
different ability. Similarly, if a question is answered right by eighty 
men in the first group and only forty men in the second, the question 
is over-discriminating. If a question is answered right by forty men 
in the first group and eighty in the second, the question gives reverse 
discrimination. 

Go back to the case of perfect discrimination by each question. 
If there had been fifty instead of one hundred men in each of the 
two groups, the number getting each question right would obviously 
have been for each group one-half of the number found in the case 
of one hundred men toa group. If, on the other hand, there had been 
two hundred instead of one hundred questions on the examination, 
if all other conditions and assumptions remain the same, the exami- 
nation score of an 80 man would have been 160, and the score of a 60 
man, 120. If in this latter case there are also two hundred men in 
each group, each question will be answered right by one hundred sixty 
men of the first group and by one liundred twenty men of the second 
group. 

Whatever the standard examination score is for a group of men of 
given ability as independently measured, the number of men in the 
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group which will get right each one of the set of perfectly discriminating 
questions of average difficulty is given by the product of the standard 
score and the ratio of the number of men in the group to the number 
of questions on the examination. If the number of men of given 
ability in one group is fifty and the number of questions is two hundred, 
the standard score for the men of that group being, say, 160, the 
number of men getting each such perfect question right will be (160 x 
5% oo) that is, 40. This relationship must be used to reduce the 
standard question-score per man to the standard man-score per ques- 
tion, that is, to reduce the standard examination score to what is 
hereafter called the standard corrected examination score. 

Not all the questions will be simply of average difficulty. One will 
be slightly more difficult, another slightly less, but they may still 
discriminate correctly between the grades of independently measured 
ability. In order to judge quickly the degree of discrimination shown 
by any question, it will be convenient to express the standard man- 
scores per question as ratios of one another. If, for example, the 
standard corrected score for a group of men of 80 ability is 30 and for 
a group of men of 60 ability is 15, the ratio of the score for the first 
group to the score for the second is 2.00. Then if on a given question 
twenty men of the first group and ten men of the second answer 
correctly, although the respective scores are not the standard corrected 
scores, it is at once clear that the question still discriminates correctly 
between the two groups, for the ratio of the scores made on the question 
is 2.00. 

When the whole number of men taking an examination is divided 
into groups according to their independently measured abilities in the 
field (three such groups will generally be advisable for the purposes 
of the later judging of the questions), it will rarely, if ever, happen that 
all the men in any one such group will have the same independent 
measure of ability. The grades taken as measuring this general 
ability will vary within each group. The independent measure of 
ability for each group of men as a whole will be the average of their 
individual measures., Correlative with this average for the group 
there will be an average standard examination score in place of the 
single definite standard examination score implied by the discussion 
which has preceded. 

The fact that the standard score for each group is in the nature 
of an average taken from variable measures requires another modifi- 
cation in the criteria used for judging the discrimination of the indi- 
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vidual questions. Allowance must be made for the existence of 
variability within each group, for it introduces the probability of 
chance-fluctuations in any measures of the operations of the group. 
The practically certain limits of variability within any group, which 
does not depart too far from normal in its distribution, is given by the 
range of three times the sigma (Standard Deviation) of the group 
above and below the arithmetic average of the group. The sigma 
of the independent measures of ability for each group must be found, 
and then translated, in the form of a percentage of the mean from 
which it is calculated, to terms of standard examination score, just as 
the average independent measure of ability is to be found and trans- 
lated; the process of this translation will be described in detail. The 
translated average and the translated 3-sigma range give a standard 
range of corrected examination scores for each group. 

For example, a group of men with average independently measured 
ability of 80 may have a standard range of corrected examination 
scores of 60 to 40; and a group whose independent average is 60, a 
range of 30 to 20. Respective scores on a given question which fall 
within these ranges will now be the index of a correctly discriminating 
question. The actual scores might be 60 and 20 respectively or 40 
and 30 respectively. Either pair of scores must be taken as indicating 
correct discrimination since the two scores of each pair are within their 
respective 3-sigma limits of variation. 

To keep the advantage of having ratios between scores, instead of a 
comparison between the scores themselves, for judging the degree of 
discrimination shown by a question, the limiting scores for the groups 
may be expressed as ratios of one another (the upper limit of one 
group as a ratio of the lower limit of another, and the lower limit of 
the first as a ratio of the upper limit of the second). The result is a 
range of ratios, instead of a single ratio, to be used as the criterion for 
discrimination. In the example last given, the range of ratios for 
criterion is ®%, to 4%p, or 3.00 to 1.33. 

The same theory and method apply whatever the system used 
for scoring the true-false examination. In the case of the Yale Law 
School examinations to which the method has been applied the system 
of scoring, as explained before, was twice the sum of wrong answers 
plus the sum of omitted answers. The transition from question-score 
per man to man-score per question follows here just as it does in the 
case, used purely for illustrative purposes, where scoring is by the 
total of right answers. The question is always scored by the system 
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which is used to score a man on the whole examination. In this case, 
therefore, each answer is scored for each of the groups of men by twice 
the sum of the men getting it wrong plus the sum of the men omitting 
it. The theory no longer says, as in the case of the preceding illustra- 
tions, that of a group of men, all of 80 ability as independently meas- 
ured, 160, say, if that is the standard corrected score for the group, 
will give the right answer on each of a set of questions of average 
difficulty and correct discrimination. It says instead that the men of 
such a group will on such a set of questions of equal caliber give the 
right answer to one question just as frequently as to any other, will 
give the wrong answer to one question just as frequently as to any 
other, and will omit one question just as frequently as any other. 
The frequency of answering right, answering wrong, and omitting to 
answer depends upon the level of ability of the men in the group. 
The theory says, in other words, that the man-score per question in 
such a case will be the standard corrected examination score, all 
scoring being by a common system. 

The one important point so far omitted from consideration is the 
matter of transition from the independently determined measures of 
ability in the field to the scores on a particular examination. In 
speaking of independently determined measures the meaning is simply 
that such measures must not be determined solely or primarily from 
the results of the examination of which the questions are to be judged 
for discrimination. 

In the case of the Yale Law School true-false examinations the 
independent measure of ability of a man was simply his average law 
grade for the year. Such a grade was felt to be sufficiently inde- 
pendent of a man’s showing on any one true-false examination to 
which the method of judging questions might be applied. The man’s 
law grade for the year is the average of all his grades in all his courses. 
Generally a man would take about five courses and in approximately 
three of these, as the general case, would be given a true-false exami- 
nation. Almost without exception such a true-false examination 
would not stand alone in the course but would be supplementary — 
to a written examination of the usual type. A man’s showing on 
any one true-false examination in any one course is, therefore, but 
a small part of his average law grade for the year in all courses. Such 
average year grades are not, of course, precise measures of ability in 
the field of law. Such grades, however, and it is the only important 
point in this connection, will be as good a reflection of the relative 
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differences in ability of a group of men.as is possible with anything so 
arbitrary in nature as an academic grade. 

The connection between the independently determined measures 
of ability and the scoring of the given examination of which the ques- 
tions are to be judged for discrimination may be made by a simple 
transmutation. A generally satisfactory method of making the 
transmutation for those scores which are necessary in the establish- 
ment of criteria for the judging of questions, without any waste of 
time over those scores which are not necessary for the purpose, may be 
illustrated by the procedure followed in the case of the Yale Law 
School examinations. . 

The average or year-grades of all law students for the year con- 
cerned are tabulated in a frequency series and cumulated downward 
from the higher grades. On the examination to be judged the scores 
are similarly tabulated and cumulated downward from the lower 
scores, since on the system of scoring used a low score means a better 
showing than does a high score. Suppose there are three hundred 
men represented in the general distribution of year-grades and ninety 
in the distribution of the scores on the particular examination. The 
ninety men taking the examination have been divided into three 
groups of equal size according to their year-grades. Suppose that 
one of these groups has an average of 80 in year-grades. In the 
general distribution of year-grades there is found the percentage of 
the three hundred men with grades of 80 or higher. In the distribution 
of examination scores this same percentage of the ninety men there 
included will have attained or exceeded a certain score (exceeded here 
in the sense of getting a lower score). This score is the transmuted 
equivalent of the eighty in year-grades. When this score is multi- 
plied by the ratio of the number of men in the particular group (30 
in this case) to the number of questions on the examination, the product 
is the average standard corrected examination score for the group. 
The calculation of the 3-sigma range of variation follows directly at 
this point, in accordance with the method already indicated and 
explained more precisely in the section on Procedure. 

This method of transmutation assumes only that grades or scores 
on any one examination of any kind in the Law School will tend to 
follow in their actual significance a distribution which will be fairly 
uniform with the distribution of final year-grades given to the students 
in the school. The assumption is one of similarity and not of coinci- 
dence. So long as the particular examination is intended for no 
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unusual purpose but simply for the customary testing of relative 
abilities, and so long as the number of cases in the distributions of 
grades and scores is fairly large, the assumption seems justified. 

The rules for practical judging of the questions, as well as the 
exact method of establishing the criteria for judgment, are given in 
the section on Procedure. The language of this section, which follows 
shortly, applies specifically to the Yale Law School examinations but is 
easily generalized. The actual rules find their premise in the under- 
lying theory of the system, as that theory has been summarized here, 
but they have modified and developed out of practical experience in 
judging questions on seven true-false examinations given in the Yale 
Law School in June, 1928. No new modifications of these rules were 
found necessary when the system was applied to other examinations 
given in the school in February, 1928 and in February and June, 1929, 
but they should be considered as still open to modification as more 
experience with the system is gained. As given in the section on 
Procedure the rules are for the most part self-explanatory. 

A test has been made of the validity of the results obtained in the 
application of this system of judgment to true-false examinations 
given at the Yale Law School between 1928 and June, 1929. The test 
consisted in a series of linear correlations between average law grades 
and the true-false examination scores. 

It should be remembered that the object of the system of judgment 
is to classify the questions according to the nature of the discrimination 
they tend to show as between students of different abilities. The 
standard for judging such discrimination by the questions is based 
on the discrimination actually made between students by their law 
grades. Thus, if the judgments given by the system are correct, the 
scores made on an examination composed entirely of discriminating 
and over-discriminating questions would have a high degree of positive 
correlation with the law grades of the men taking the examination; 
those made on an examination composed entirely of non-discriminating 
questions would have a very low correlation; those made on an exami- 
nation composed entirely of reversely discriminating questions would 
have a negative correlation. 

The test, in its particular form, was suggested by Mr. Paul W. 
Burnham of the Department of Personnel Study at Yale. The results 
of the test are given in Table II. It will be observed that throughout 
the eleven examinations listed in the test the correlations of law grades 
with the actual examination scores are lower than the correlations of 
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law grades with the scores which would have been obtained if each 
examination had consisted only of its discriminating and over-dis- 
criminating questions, and that these latter correlations in themselves 
are positive and relatively high. It will be observed, further, that the 
correlations of law grades with the scores which would have been 
obtained if each examination had consisted only of its non-discrimi- 
nating questions are in still another category of distinctly low degree, 
and that the correlations of law grades with scores based entirely on 


TaBLE II.—CoRRELATIONS BETWEEN AVERAGE LAw GRADES AND TRUE-FALSE 
EXAMINATION Scores, FOR ELEVEN EXAMINATIONS GIVEN aT YALE Law 
ScHoou, 1928-1929 





| 
Coefficients of correlation 











Law grades with examination scores that would 
have been obtained if examination had consisted 
Examination | 1@w grades ey a ae 
| with actual : 
> J ti . . . . 
| examination | Discriminating | ile Revennly 
scores and over- RP Pinta Wath 
: Ege acti oll discriminating | discriminating 
discriminating x ; 
questions questions questions 
I 71 .81 .29 — .26 
II .82 .87 .40 — .27 
III .69 .78 .41 — .35 
IV .66 .69 .29 — .24 
V 75 84 .29 — .36 
VI 46 .58 .40 — .08 
VII .57 a 10 — .31 
VIII .50 .75 12 — .63 
IX 51 81 — .02 — .54 
x .58 .67 .28 — .24 
XI .56 71 .36 — .48 

















Note.—Coefficients are positive except where indicated as negative. 


reversely discriminating questions are uniformly negative. The degree 
of success in the application of the system obviously varies from 
examination to examination, but shows a marked variation downward 
in only one of the eleven examinations tested (number VI as they are 
given in Table II). The correlations obtained are such as apparently 
to establish the validity of the judgments. 


APP SI “ 4 
a ‘ 


“yet Pe ea 





a 


ES et - sett agg ol 
RR ae ge = eR FESS 


“ 







a ee 
Hie 


rk 
we em 
av are’ 


a a ral oa cod 











Sa Sale SF Fp te 


ee ee 
eo ; 
ora iy 


* 
a es 


+4 
of 


ba «al 
ane 
Hy 


that 
; = = See oe TT ~~ 
Sis ras eee AL 


300 


The Journal of Educational Psychology 


III. ProcepuRE 


The following are instructions for judging questions on true-false 


examinations of Yale Law School. Each numbered paragraph is 
hereinafter referred to as a section. 


1. 


On each examination paper for a given examination place the student’s law 
grade (average grade for year in all courses) as the independent measure of the 
student’s ability. 

For the given examination group make a distribution plot of the law grades, by 
tenths of per cent (Form 1). 

From Form 1 divide the examination into thoes sub-groups according to law 
grade, each sub-group containing the same number of students. Call these 
three groups, in descending order of law grade, X, Y, and Z, respectively. 


Norte.—Should the examination group not be exactly divisible by three, it will 


be necessary to omit one or two cases taken from the middle of the group, such cases 
to be given no further consideration. 


4, 


5. 


6. 


- 


According to the division determined in section 3, sort the examination papers 
into the three sub-groups X, Y, and Z. 

Tabulate law grades for each sub-group X, Y, and Z, using class-interval of one 
per cent (Form 2). 

Plot separately for each of the sub-groups X, Y, and Z the wrong answers and 
the omissions on each question in the examination (Form 3). 

From Form 3 transfer to the respective columns X, Y, Z, and Total on Form 4 
the scores of each question. 

Notre.—The score is to be calculated by using the same basis as that used by 


the Law School for scoring a student’s examination paper, e.g., in June, 1928 the 
student’s score was given by twice the sum of his wrong answers plus the sum of his 
omitted answers. 


8. 


10. 


11. 


Tabulate the average law grades last reported by the Law School for all 
students in the school, using a class-interval of one per cent. Then cumulate 
downward from the high grades (Form 5). 

Tabulate the scores on the given examination (sub-groups X, Y, and Z com- 

bined), using a class-interval of one point. Then cumulate downward from 

the low scores (Form 6). 

From Form 2 calculate the arithmetic mean and ‘‘sigma”’ of law grades for 

each sub-group X, Y, and Z. 

For each sub-group X, Y, and Z use the following procedure: 

(a) Take the sub-group mean law grade as found in section 10 and find in the 
cumulative distribution column of Form 5 the number of students having 
that or a higher grade. Express this number as a percentage of the total 
number of students, z.e., of N on Form 5. 

(b) Apply this percentage to the number of students in the examination under 
review, 1.e., N on Form 6; in the cumulative distribution of Form 6 locate 
the figure so gained and read the corresponding score in the examination 
score column of the same Form. This score is the average standard exam- 
ination score for the group. 





i2 


13. 
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(c) Multiply this average standard examination score by the ratio 


Number of students in sub-group (X, Y, or Z) 
Number of questions on examination 





The result is the average standard corrected examination score for the group. 
(d) Multiply this average standard corrected examination score by the ratio 


sigma (as found in section 10) 
mean (as found in section 10) 





The result is the standard corrected score value of the sigma-variation for the 
group. 

(e) Three times this standard corrected score value of the sigma-variation 
(just found in d) added to and subtracted from the average standard cor- 
rected examination score (found in c) gives the standard range of scores for 
the group. 

12. List the standard range of scores for the three sub-groups on Form 4 and express 
them as ratios, thus: 


_ Upper limit of X - Lower limit of X 
Lower limit of Y ~— Upper limit of Y 
_ Upper limit of Y ,, Lower limit of Y 
Lower limit of Z ~ Upper limit of Z 
_ Upper limit of X , Lower limit of X 
Lower limit of Z © Upper limit of Z 

13. Grade as N (see section 16 below) those questions which have in each sub-group 
X, Y, and Z scores less than the lower limit of the standard range of scores for 
sub-group X as found in section 1le. 

Nore.—If a question is so easy that in each sub-group the score is less than 
would normally be expected in even the best sub-group, it is obviously a non-dis- 
criminating question. If a question is difficult and in all sub-groups only a very 
few men get it right, it is advisable to apply the tests for discrimination given below, 
since it is more in the nature of a difficult question to be discriminating. 

14. For all remaining questions fill in the ratio columns on Form 4 as follows: 
Calculate and enter the X/Z ratios; calculate and enter either the X/Y ratios 
or the Y /Z ratios, whichever has the smaller upper limit in the Standard Ratio 
Ranges (as found in section 12 and listed on Form 4). In calculating these 
ratios consider a zero score as a score of one. X/Y will normally be the second 
ratio used. 

NotEe.—Two ratios for the questions are sufficient for grading purposes, the 
third ratic, when necessary, being determined by the relation between the other 
two, thus: Y/Z equals X/Z divided by X/Y, and X/Y equals X/Z divided by 
Y/Z. Assuming for the moment fairly normal distributions within each sub- 
group, the Y /Z standard ratio range will have a smaller upper limit than the X/Y 
standard ratio range if the mean law grade of sub-group Z is farther below the mean 
of sub-group Y than the mean of sub-group Y is below the mean of sub-group X. 
That is, the Y/Z ratio will be used in preference to the X/Y ratio if the Z group 
shows greater divergence from the Y group than the Y group shows from the X 
group. If the sub-group distributions are highly abnormal so that the 3-sigma ranges 
overlap badly, the system will be applicable only after extensive modification, if at all. 
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15. Grade according to the following rules the remaining questions on Form 4, i.e, 

those questions not graded N under section 13. 

(a) Grade as N (see section 16 below): 

(1) Questions with X/Z ratio greater than 1, but the total score of the 
question is less than three times the lower limit of the Standard Range 
of Scores for sub-group X as found in section 1le. 

Nore.—It is felt unwise to label a question Reversal instead of Non-discrim- 
inating (or Over-discriminating instead of simply Discriminating) unless the 
question is of such difficulty that the total score for all three sub-groups is at least 
equal to three times the lowest score normally expected in the best sub-group. 

(2) Questions with X/Z ratio equal to 1, or between 1 and the upper limit 
of the X/Z Standard Ratio Range as found in section 12. 

(b) Grade as R (see section 16 below). Questions with X/Z ratio greater than 

1, and the total score of the question is equal to or more than three times 

the lower limit of the standard range of scores for sub-group X as found in 

section lle. 

(c) Grade as O (see section 16 below). Questions satisfying all the following 

requirements: 
(1) X/Z ratio less than the lower limit of its standard range as found in 
section 12, and 
(2) X/Y ratio (or Y/Z ratio, whichever is used as determined in section 
14) less than the lower limit of its standard range as found in section 12, 
and 

(3) The third ratio no greater than the upper limit of its standard range as 
found in section 12, and 

(4) Total score for the question equal to or more than three times the lower 
limit of the standard range of scores for sub-group X as found in sec- 
tion lle. 

(d) Grade as D (see section 16 below): 

(1) Questions that satisfy requirement (1) for an O question (in paragraph 
c just above) but do not satisfy one or more of the other requirements 
for an O question. 
(2) Questions with the X/Z ratio within the limits of its standard range as 
found in section 12. 
16. Definitions of classes of judgment: 

N. Questions that do not discriminate between sub-groups. Each sub-group 
tends to react similarly to such questions, making the questions of no value as a 
test of comparative ability between higher and lower sub-groups. 

R. Questions which discriminate inversely between sub-groups. In such cases 
students with the higher law grades obtain poorer results than do students with 
the lower law grades. 

O. Questions that discriminate between sub-groups to a greater extent than 
is justified by the relative levels of law grades. 

D. Questions that make normal discrimination between sub-groups. 
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Form 1.—DistTrisuTion Piotr or YEAR Law Grapes For StupENTS TAKING 
TRUE-FALSE EXAMINATION IN .......... , JUNE 1928 
| og Tenths of law grade 
" law | | i : 
| grade se 2 | 3 | 4 5 | 6 | .7 8 .9 | Total 
—_— - y j 
mim}... 1 2 oe | 1 
aS 2h oe | = | 2 
2 Sais | 1 | 1 
1 b Rese, CO | | 1 
80 1 re ie | | | 1 
2 ae Be Ee Pee oe | | ba 1 
op fe SE Bee a. 4 eae ee oe | ra | 7 
7 1 1 [oe nee se [ee 3 
6 1, 1 2. % ee oon ee 4 
rea ea eee se  Saee sl 4 
4 eS eee Pee eee | eee | 2 
(Y)| 3 Be: ee ee ree pbilais 
ae LS 2 Pre te oe ae 5 
“HLS oe Pract 2 Cats 6 
70 oe ee as ine | (1)* | (1)! “ 5 
69 1 1 1,1 sa 4 DT. MEE ca twee 1 9 
ae ES eT PS SS ee 7 
or e188)... 1 EE 8 |---| 41 Sos. ae 11 
6 te 2 Pa 8 5 
5 1 1 1, 1 came We 1 1 6 
4 Pree 2 ey eee 4 
3 1 es 3 
2 1 |. 1 2 
| agg 
| 06 



































N.B.—The solid lines mark the division into three groups X, Y, and Z, each 
containing 31 cases. 
‘The two cases at the middle of the distribution which are omitted from 


all further consideration, making the whole group exactly divisible into three 
sub-groups. 
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Form 2.—Grovup DistrisuTions oF YEAR Law GraDEs FoR STUDENTS Taking F 
TRUE-FALSE EXAMINATION IN .......... , JUNE 1928 
Group X Group Y Group Z 
erat e AE DOM IES EOL =k 
Number of Number of | Number of 
Grade students Grade students Grade | students 
84 1 73 | 1 67 11 
3 2 2 . & 6 5 
2 1 1 6 5 6 
1 1 70 3 4 4 ” 
80 1 69 q 3 3 
79 1 8 7 2 2 
8 7 —- — 
7 3 Total...... 31 Total...... 31 an 
6 4 
5 4 
4 2 
3 4 
Rs 31 
Fc 
Form 3.—Cueck List, BY QUESTIONS, OF NUMBER OF STUDENTS IN Eacu Svs- Pa 
Group X, Y, anD Z ANSWERING WRONG OR OmMITTING ANSWER, ON TRUE- 
FALSE EXAMINATION IN .......... , JUNE 1928 
| Group X | Group Y | Group Z 
Question : : , oo _ 
number | Wrong | Omitted | Wrong | Omitted | Wrong Omitted 
answer answer | answer | answer | answer answer 
1 0 0 
2 12 2 
3 14 3 (similarly) 
4 5 0 
5 11 7 
6 3 8 
Ete. Ete. 
he 








NG 


of 
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Form 4.—Jupeinc Criteria, Scores, Ratios, AND QuESTION GRADES FOR 
ob eeluabanaes » JUNE 1928 


TRUE-FALSE EXAMINATION IN 





—_—_—_ 




















Standard range of Standard ratio ai 
Sub-group phate ranges Score criterion 
xX 10-13 X/Y, .81-.53 For N: Each sub- 
16-19 Y/Z, .90—.67 group score 9 or 
Z 21-24 X/Z, .62-—.42 less. 
For Rand V: Total 
score 30 or more. 
, Score Ratio 
= Question grade 
— X | Y |Z |Total| X/Y X/Z 
1 0; 3| 4 7 vr ase N 
2 26 | 23 | 30 79 1.13 .87 N 
3 31 | 39 | 33 | 103 .79 .97 N 
4 10 | 20 | 23 58 .50 44 D 
5 29 | 31 | 39 | 99 .94 .74 N 
6 14; 8; 8| 30 1.75 1.75 R 
Ete. Etc. 





























Form 5.—SimmpP_Le anpD CuMULATIVE DISTRIBUTIONS OF YEAR Law GRADES FOR 
At. STUDENTS IN THE ScHOOL, JuNE 1928 
(These Figures Are Hypothetical) 








Number of students 














Grade 
Simple distribution Cumulative distribution 

90 1 l 
89 0 1 
88 4 | 5 
87 2 7 
86 10 17 
85 3 20 

| 

| 
62 | . 284 
61 4 288 
60 7 295 
59 | 2 297 
58 3 300 

, ll, 300 





Applying section 11: Suppose mean law grade for a Group X is 85; it is found 
here that 6.7 per cent of the total men in the school had a law grade of 85 or better. 
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Form 6.—SImmPLE AND CUMULATIVE DistRIBUTIONS OF EXAMINATION SCORES on 
TRUE-FALSE EXAMINATION IN .......... , JUNE 1928 
(These Figures Are Hypothetical) 








Number of students 





Score 





Simple distribution | Cumulative distribution 
HE | 20 | 
bas 21 | 
eat 22 

eo isg 23 
24 
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90 
90 
91 
93 
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Applying section 11: The figure determined for Group X from Form 5 was 6.7 
e per cent; 6.7 per cent of the total men (93) taking this particular examination is 6; 
it is seen that the examination score of 23 was attained or bettered by 6 men; 23is 
the average standard examination score of Group X. 
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THE SIGMAS OF COMBINED DISTRIBUTIONS 
CALCULATED FROM SIGMAS, MEANS, AND 
FREQUENCIES OF COMPONENT DISTRIBUTIONS! 


C. R. GARVEY 


National Scholar in Child Development, University of Minnesota 


It is sometimes necessary to compute for a series of measure- 
ments, not only the mean and standard deviation of the series as a whole, 
but also the means and sigmas of several fractional distributions, 
components of the total or general distribution. The labor of calcu- 
lation is thus doubled. If it is necessary to fractionate the data on 
more than one basis, 7.e., into more than one system of fractional 
distributions, the labor involved is multiplied by the number of such 
independent systems of fractionation plus 1 (the general system of 
parameters). 

Obviously, this increase in labor can be kept down in the case 
of the means, by finding the smallest fractional means first and using 
them to compute weighted means for the others. Where z = each 
measure and = the mean of N such measures, the formula would 
be 
_ (%:)Ni + (22) N2 
aaelees 3 a) 


in which subscript 1,2 refers to a distribution composed of distributions 
land 2. In practice, one operation is saved by using the form 


a Sz 1 + Sr 
i.2 = NitN: (2) 
‘in which S indicates summation. 

This procedure is familiar to everyone. The present purpose is 
to extend it to the calculation of the standard deviations. The writer 
wishes to claim no priority for the following original formula, its 
essential features having been given in a similar one by Yule many 
years ago. But since a survey of such frequently encountered texts 
as those of Kelley, Garrett, Thurstone, Fisher, Rugg, and Pearl 
indicates that the method is not in general use, it is thought worth 
while to present it here. 











The writer wishes to thank Dr. Florence L. Goodenough and Phillip J. 


Rulon for reading the manuscript, without imposing any responsibility upon either 
of them. 
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Where A = x —Z 


Corey ~ 
Pom RS 
7 & ae 


we fs cs ees RSS 


ee ek 
ee oem Yr 


a [Sa 
tile 


When d = zx — ag and az = arbitrary mean 


_ Pa | 
G¢ = _— { 


IGE Sr, 


fogs Parse 


bee Now : . 


then 





PPE THES HOES. 


| 
=| 

| 
er 
— 


od 
Amis cles 


When az = 0, then d = z, and 


2 ee 





eters 


but 


SE Th Pe et peeter es, or ce 


RSS Site Up 


22 sopmeta OS ash 


x ae ts 


ee 3 
I 
8 


(3) 


= ct ee ee 
ie 
| 


and 
_ Sx’ 


ue N 
ae This is essentially the formula given to his students by the late James 


Se. oy 
Ea oe 
27 = eae Set 
en i 
Fee hae 


aS + 


SE: 
Q 
ll 
z 
wn 
a2a2a0 Bort Oe Baws wae ws 


=) 
—. 


vee Beardsley Ruml in 1916.2 
Transposing and assigning subscripts designating fractional dis- 
tributions, we have 


i Sz,’ — %,2 = o;? and ae — Z,? = a2? (4) 


i=] 


VS @) 1 2 tr 





ry 1 The Arithmetic of the Product Moment Method of Calculating the Coefficient 

Beh 3 of Correlation. American Naturalist, Vol. XLIV, 1910, pp. 693-699; especially sit 
695f. ) 

| 20On the Computation of the Standard Deviation. Psychological Bulletin, 

ve Vol. XVIII, 1916, pp. 444-446. 
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By summation and from equations (1) and (2) we have 


Sz;? + Sz? 
Nit+Ns, 


In machine calculation a table is set up, in which the rows are num- 
bered to correspond to the subscripts in the formulx, and which has 
the following eight column headings: 


— B12 = 071.2 (5) 


























1 2 | 3 | 4 5 6 7 8 
| | 
2 2 

Sz N | Sz? a at 22 = Sa - a =o! e 





Divide the data into fractional distributions of the greatest com- 
mon denominator, so that any desired combined distribution can be 
made up from these fractional distributions without further sub- 
division. Sum all the measures in the first small distribution and 
place the sum opposite subscript 1 in column 1. Place the number of 
measures in column 2. Square each measure (from a table), sum the 
squares, and place this sum in column 3. The rest of the table is 
self-explanatory, each subsequent entry being derived from previous 
entries. Completion of a row of entries to column 8 fulfills equation 
(3). The mean and sigma of a combined distribution can be obtained 
by summing columns 1, 2, and 3, and calculating the entries for sub- 
sequent columns from these sums. Any simultaneous portion or 
portions of these first three columns can be summed, and thus any 
desired distribution can be combined from the appropriate fractional 
distributions. This fulfills equation (5) for any desired combination 
of subscripts. 

In reviewing literature or in other cases where Sz? is not given we 


need a formula using sigma instead. From equation (4) 
Sz ? 


pe ne £;? = ¢;" 
1 


transposing and multiplying by N; gives 
Say? = (¢,? + #,")N, 
similarly for Sz,?. Then equation (5) becomes 


(0:7 + £:2)Ni + (2? + %2?)N2 - 
Nit Nz 





Z*1.2 = 071.2 (6) 

















_ 
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Substituting the value of Z1,2 from equation (1), and extending to 
include n fractional distributions, we have 
(0:2 + £:2)N1 + (02? + H22)N2 + +: + (on? + En2)No 
NitN2+--:WNa " 
_ (4)Ni + (42)No +--+ n)Nn_ - 
Nit Net-:--:WNa mee, 


This is comparable with Yule’s equation (7),! 
N:o? = (Nm ‘ Tm’) + Z(N m ; dm), 


except that here, equation (6), we use the fractional means themselves, 
whereas Yule uses the deviations d,, dz, of these means from the general 
mean £1,2, . .. ». Where these means are small they may as well be 
used directly. Where they are extremely large, the deviations are more 
easily used. Experimenters should always report N along with z and 
sigma, so that their data can be included in a summary, and so that 
reviewers can actually review the author’s work, instead of being 
limited to merely quoting the author’s conclusions. Of course, if a 
reviewer combines measurements made by separate authors, he must 
examine the comparability of the conditions under which the separate 
sets of measurements were taken, just as he must in case of sets of 
data taken by the same author or by the reviewer himself. 











1 “An Introduction to the Theory of Statistics,” 9th ed., London, 1929, p. 142. 
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A NOTE ON THE sea Cie OF THE HARMONIC 
A 


EUGENE SHEN 


Kwang Hua University, Shanghai 


The writer has found that students completing an introductory 
course to statistical method often have a very hazy notion on the mean- 
ing of the harmonic mean. While they can make computations 
according to the formula, and even remember that its application 
has to do with the calculation of average rates, they seldom appreciate 
the problem in its comprehensive setting. They often fail to see that 
the correct use of the formula depends upon the joint operation of 
two factors: The way in which events take place and the way in which 
records are made. 

As the harmonic mean in educational psychology usually relates to 
the question of time and work, we shall take an illustration in this 
field. A group of students competing in addition may be required 
either to work during a uniform amount of time or to finish a uniform 
amount of work. Usually, records are made of the amount of work 
finished in the given time, or of the amount of time necessary for the 
given amount of work. The arithmetic mean in the two cases would 
correctly give the average work per unit time and the average time per 
unit of work, respectively. For purposes of comparison, we can use 
either one with the reciprocal of the other. The harmonic mean is not 
called for. 

Sometimes, however, data are given in terms of time per unit work 
when they are derived from a constant amount of time, or on the other 
hand, in terms of work per unit time when they are derived from a con- 
stant amount of work. In both cases, the arithmetic mean would be 
incorrect, and the harmonic mean should be used instead. 

The writer has proposed a definition of the harmonic mean as a 
special case of the weighted arithmetic mean where the weights are 
equal to the reciprocals of the measures.* The true average height of 
man is not given by an unweighted arithmetic mean of the heights 





1 Tf in the foregoing we substitute price for rate, money for work, and commodity 
for time, we have a problem most frequently found in the field of business and 
economics. 


2 “The Foundations of Experimental Psychology,” edited by Carl Murchinson. 
Clark University Press, 1929, p. 839. 
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of different groups by race, residence, or political affiliation, because the 
groups vary in size and should be given different weights in proportion 
to the population in each group. Similarly the unweighted arith- 
metic mean of a series of rates for a uniform amount of work is incorrect 
because the varying rates do not operate for the same length of time, 
and needs to be weighted accordingly. Clearly the length of time for 
which each rate operated is proportional to its reciprocal, and therefore 
the correct mean is derived by weighting each rate according to its 
reciprocal. This is precisely what the harmonic mean is. 

It is the opinion of the writer that the proposed definition leads to a 
more systematic and more comprehensive conception of the harmonic 
mean. He has found it of valuable assistance in clarifying the mind 
of such students as are apt to be refractory to other methods of 
approach. Mathematically it is of course equivalent and reducible 
to the usual definition, as the reciprocal of the arithmetic mean of the 
reciprocals of the measures: 


(zx) ae 
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NOTE ON THE STANDARD ERRORS OF THE 
STANDARD ERRORS OF ESTIMATE AND 
MEASUREMENT 


CHESTER E. KELLOGG AND KENNETH W. SPENCE 


McGill University 


1 
In the course of research on the reliability of the high-relief finger | 
maze, recently completed by the junior author of this note, and to be | 
published soon as part of a comprehensive study, occasion arose to 
compare the standard errors of measurement of intelligence tests and 
various methods of scoring maze records. In order to have a check 
on the validity of the conclusions drawn, we derived formulas for the 
standard errors of these and related measures. 

It might naturally be supposed that the standard error of a stand- 
ard error of measurement could be determined, as in the case of an 
ordinary standard deviation, by dividing by (2n)”*. In the case of the 
standard error of estimate, SD,, = SD(1 — r*)”, the corresponding 
formula does hold good. For taking differentials, we have: 


SDrdr| 


dSDiu = (1 — 1)4d8D — Gay 


Squaring, summing, and dividing through by n, 


SD?r°SD,? 
2 _ in 2 
SD*a = (1 — r*)SD*, + (i — 3) 
Assuming approximate normality, and using formulas 32a, 108b, and 
125a from Kelley’s “Statistical Method,” this becomes: 


(1—r2)SD? , SD%r2(1 — 72)? SD Xr. (1-1) SD 














2n —“G—-rn One’ 
which reduces to (1 — r?)SD?/2n. 
Accordingly, 


so SD(1 = 2 stall SDese. 
-_ (2n)* ~ (2n)ye 

Similarly, for the standard error of measurement, 
SD oes. = SD(1 — r)”*, we have, taking differentials 


SDdr 
2(1 — r)s 


SD. 





ASD. = (1 — r)4*dSD — 
313 


Q.E.D. 1). 
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Squaring, summing, and dividing through by n, 
sD, = (1 — r)SDt, + SP 
meas. 4(1 — r) 
-(2sDa — r)?4 X tna X SD, X SDwa ) 
2(1 — r)’s 














_ (1 — r)SD? 4 SD*1 —1r*)? (SD Xr x (1 — r?) x SD 
ws 2n 4n(1 — r) 2>4 ns (2n) ) 
which reduces to SD?(1 — r)(3 — r?)/4n. 
Accordingly, 
(3 — 1°) 
-_ — rr) 
SDs SD(1 — r) (4n)¥ 
_ SDase (3 — 7°) 
et (4n) 4 
To facilitate calculation of the SDs,..., we have tabulated 
representative values of (3 — r?)’*/(4n)”. 

















r 
n 
.00 .10 .40 .65 . 80 .90 95 . 99 
25 .1732 | .1729 | .1685 | .1605 | .1536 | .1480 | .1448 | .1421 
30 .1581 | .1578 | .1538 | .1466 | .1402 | .1351 | .1322 | .1297 
35 .1464 | .1461 | .1424 | .1857 | .1298 | .1250 | .1224/ .1201 
40 .1369 | .13867 | .1332 | .1269 | .1214 | .1170 | .1145 | .1123 
45 .1291 | .1289 | .1256 | .1197 | .1145 | .1103 | .1080 | .1059 
50 .1225 | .1223 | .1191 | .1135 | .1086 | .1046 | .1024 | .1005 
55 .1168 | .1166 | .1136 | .1082 | .1036 | .0998 | .0977 | .0958 
60 .1118 | .1116 | .1088 | .1036 | .0991 | .0955 | .0935 | .0917 
65 .1074 | .1072 | .1045 | .0996 | .0953 | .0918 | .0898 | .0881 
70 .1035 | .1033 | .1007 | .0959 | .0918 | .0884 | .0866 | .0849 
75 .1000 | .0998 | .0973 | .0927 | .0887 | .0854 | .0836 | .0821 
85 .0939 | .0938 | .0914 | .0871 | .0833 | .0802 | .0785 | .0771 
100 .0866 | .0865 | .0843 | .0803 | .0768 | .0740 | .0724 | .0710 
150 .0707 | .0706 | .0689 | .0655 | .0627 | .0604 | .0591 | .0580 
200 .0612 | .0611 | .0596 | .0568 | .0543 | .0523 | .0512 | .0502 





























Although it is not likely to be much in demand at present, we have 
also derived a formula for the standard error of the standard error of 
estimate of true score, SD... = SDi(ru — r*u)”*. (Kelley, formula 
169.) 


_ SD(dr — 2rdr) 


dSD..1 = B(r — 1838 + (r — r*)}4dSD. 








rr 


rr, 








wr 
“Se 


ave 
rT of 
iula 


Standard Errors of Estimate 315 


Squaring, summing, and dividing through by n, 


a (1 — 2r)? 
SD*a - posiioli 7 ee 5 + (r — r*)SD?,, 


+ 2SD(r — 14)¥(r48D,SDyq — Tas D SD ea) 
= wi 
si gp2(t — 4r + 5r? + r3 — 4r4 — r5 + 2r) a 














4n(r — r?) 
Accordingly, 
SD, . = S8D(r - 72) 36 — 3r + 2r? + 3r3 — r4 — 2r5)3s 
= (4nr(r — r?))>8 , 


re sp. — 3r + 2r? + 3r8 — rt — 2rd)3s 
(4nr(r — r?))>4 
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NEW PUBLICATIONS IN EDUCATIONAL 
PSYCHOLOGY AND RELATED FIELDS OF 


ran EDUCATION en 


CONDUCTED BY FRANCES M. FOSTER 











Educational Psychology, by Monroe, DeVoss and Reagan. New York: 
Doubleday Doran, 1930. Pp. XIII + 607. 


Problems in Educational Psychology, by Gifford and Shorts. New 
York: Doubleday Doran, 1930. Pp. XIV + 728. 


These volumes are two from the excellently planned Teacher 
Training Series, edited by W. S. Monroe, one of the authors of the 
Educational Psychology. ‘The second text listed above was designed to 
supply companion readings to any basal text in educational psy- 
chology, but the authors acknowledge that they had the Monroe, 
DeVoss and Reagan text principally in mind when they planned 
their work. 

The Educational Psychology is a well-constructed book exhibiting 
many excellent features. In the selection of the subject-matter a 
wise catholicity has been shown. The topics included are—The 
Physical Mechanism; Human Responses to Stimuli; The Learning 
Process; Learning in School Activities; Transfer of Training; Intelli- 
gence and its Measurement; Measurement of Achievement; Indi- 
vidual Differences; Characteristics of Children at Different Pedagogical 
Levels; The Psychology of Elementary-School Subjects; The Psy- 
chology of High-School Subjects; Mental Hygiene; and How to Study 
Pupils. Such a list can be made strong meat and in their efforts to 
keep the text at an elementary level a few of the topics necessarily 
become a little obscure. The “limit of improvement’’ is an illustra- 
tion of this. Apparently the authors take up the position of believing 
in limits for manual habits only; the acquisition of knowledge may 
proceed with ever-increasing facility. Which, of course, is a mis- 
interpretation. The authors have also hunted with the hounds and 
run with the hares in respect to subjective and objective observations 
although, as scientists, they have emphasized throughout their volume 
the data obtained from carefully controlled experiments. The best 
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feature of the book is the consistency with which technical terms have 
been used. When, for example, they wish to speak of intelligence 
derived from the use of a test as distinguished from intelligence as a 
theoretical concept they use ‘‘intelligence as measured”’ and thus get 
rid of any ambiguity. The Learning Exercises for the reader at the 


end of each chapter are truly exercises to promote further learning ° 


and are commendable. 

The Problems in Educational Psychology is an anthology of excerpts 
or readings from works of 192 authors. Thorndike and Woodworth 
are quoted the most frequently, and justly so. The selections appear 
to have been made more carefully than those of previous compilers 
such as Skinner, Gast and Skinner. By listing in each chapter 
“Suggested Problems’ and ‘Supplementary Learning Exercises”’ 
some attempt to justify the selected title has been made. 

Both volumes are well printed and strongly bound, and both are 
remarkably free from typographical errors. The reviewer wishes 
them the success they deserve. P. SANDIFORD. 

University of Toronto. 





Minnesota Mechanical Ability Tests, by D. G. Paterson, R. M. Elliott, 
L. D. Anderson, H. A. Toops, and E. Heidbreder. Minneapolis: 
University of Minnesota Press, 1930. Pp. XXII + 526. 


It is a genuine pleasure to study the methods—so thorough, so 
cautious, and often inventive—which characterize this excellent and 
important investigation. Within recent years there have been 
published few other studies of similar thoroughness, so insistent on the 
accuracy of the instruments of measurement, so critical of their 
validity. There is reason to think that with this type of investigation 
(one would like to name those few others of equally fine caliber!) the 
measurement of human behavior has, within the last few years, 
achieved one more significant step in its advance towards recognition 
as anexact science. In the field of mechanical ability undoubtedly this 
particular work is fundamental. 

The authors set’ out to investigate the field of mechanical ability 
to determine adequate answers to the two main questions: (1) Is 
“mechanical ability’ one ability or many? and (2) How is it related to 
other traits such as verbal intelligence and motor ability? 

By force of circumstances, similar to those experienced by investi- 
gators in the field of measurement of verbal intelligence, it became 
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necessary to define mechanical ability as ‘that which enables a person 
to succeed in a definitely restricted range of vocational and trade 
school courses.”’ 

A survey of literature in the field yielded twenty-four tests relating 
to mechanical ability. These were administered in a preliminary 


’ investigation to two hundred seventeen boys, all taking shop-work, in 


Grades VII and VIII in a junior high school. Nearly all of the tests 
were readministered after a suitable period of time, thus yielding a 
measure of reliability. 

The criterion selected was shop grades, determined as objectively 
as possible in terms of quality of work, quantity of work, and informa- 
tion displayed in examinations. 

The scores on every test were then correlated with (1) the criterion 
scores and (2) the average scores on two verbal intelligence tests. 
Seven tests, yielding the highest correlation coefficients with the 
combined criterion, were selected to comprize the final battery 


(Rincon: vate. = -973 Rerie svar. = 593 Rerie+ inven : rate. = 61). 


The reliability coefficients of these seven tests ranged, however, from 
.65 to .80. By lengthening some, and imposing a time limit on 
others, the coefficients were raised, theoretically at least, to range from 
.80 to .93, expectations closely approximated in the final experiment. 

In the experiment proper this battery (now known as the Minnesota 
Mechanical Ability Tests) was administered to one-hundred fifty 
incoming boys in the same grades at the same school. In addition, 
thirty-six other measures were determined, covering academic success, 
previous mechanical experience, interests, motor ability, anthropo- 
metric measures, social and economic status, and home influences. 

The validity of the criterion was determined on the basis of objec- 
tive standards of judgement—the information factor on the basis of 
objective tests, and the quantitative factor on the basis of production. 
The careful construction of objective rating scales yielded a quality 
criterion (3:3 to 6:6. judges, depending upon the function measured) 
of reliability over .90. The reliabilities of the different shop criterion 
approximated .80. The correlations of the individual tests with the 
quality-quantity criterion, however, were so low that the quantity 
criterion was abandoned. The validity correlation between battery and 
quality criterion alone was .65. The validity of the battery in respect 
of each individual type of shop work was found to be about as good as 
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that of most standard intelligence tests in respect of success in indi- 
vidual academic subjects. 

Analysis of results in accordance with the principles of Spearman 
suggests that specific factors rather than a single general factor 
characterize mechanical ability. There is, however, some evidence of 
the presence of group factors. Four-factors, on the other hand, were 
found to be unique: namely intelligence, height, agility, and mechani- 
cal ability. 

The authors thus conclude in the main (1) that mechanical ability 
as here defined is a unique trait, (2) consisting of a number of specific 
traits having possibly group factors in common. 

The reviewer would have felt entirely satisfied had the three 
following questions been answered: (1) During the actual testing 
process what control was there over the possibilities of leakage of 
information concerning the tests? 

(2) In so far as the criterion fails to measure ‘‘originality’’ does it not 
fall short of what, with occupational reference, might be termed 
mechanical ability? (3) To what extent are these findings in agree- 
ment with those of a recent investigation into ‘‘mechanical aptitude” 
by John W. Cox, one of Spearman’s pupils? 

Comment on the valuable investigation here reviewed cannot be 
allowed to pass without reference to the excellence of the binding, 
printing, and general layout of the volume. O. L. Harvey. 

University of Texas. 





Problems of Science Teaching at the College Level, by Archer Willis Hurd. 
Minneapolis, Minnesota: University of Minnesota Press. 


In these days of educational innovations and unproved panaceas 
for college ailments a bit of cautious, scientific investigation in the field 
of higher learning is like fresh water to the thirsty. For some years 
college men have been protagonists for educational experimentation 
and research in the elementary and secondary schools. Student 
achievement and that relating thereto has been measured and probed. 
Some of the most loudly advocated new methods of teaching have been 
partially examined. However, perhaps because the beam seemed 
bigger in the other fellow’s eye, there has been comparatively little 
scientific investigation of teaching problems in college. It seems quite 
fitting that one should arise from the examined group and now lead 











ee eg ee ee 





i 





a es 


ec aR 


et 





toh RD th ish ee 


he ne otis 


SLATS 
Sie Seer 


ST ee 
23 re 


SE Rr oem tan ate 
SR so ees 2S 
= 


He A. 


ot LS egy re car Se tee “ 
BR eae es 


a a 
a BES Be 


Wes aS P % OP Aes hin & 


prensa, 


a ae ee Oe A ee 
rar 


ee pas . 
PP a ee ae 
Ses : 


Poh kee 
4 * 2, 


Sl ke al =" =. = 
aS Ae is t=; a eet 
ee 
OA tae 


> = 
= ate 
. 4 
~~ 2 cotan ~¢. 


rast. 
ay 
Fe 


io EES 


lt - 
nari 
; 


a a 


Sx 
= iin! 


‘ 
cae 
x 
en hae 
nS, 
Ps 
+: 


aes i : arma te Se her: 
vale - ; ~ .. re. és 
- : = 


<i Oa : — 
git ee ee 
Oe gat: . 


rth 
ae 





320 The Journal of Educational Psychology 


some study of a similar nature in the realm of those who first made up 
the game. 

The work reported concerns itself with the following problems: 
(1) What differences in individual achievement in the study of anatomy 
is produced between those who work in groups of two and those who 
work in groups of four on acadaver? (2) What is the effect of limiting 
the time given to laboratory work in human physiology or of partially 
replacing the laboratory work with library work? (3) What is the 
effect of eliminating laboratory work in the study of ‘“‘Mechanics’’? 
(4) What effects has class size on individual achievement in the physics 
courses of ‘‘Heat’” and ‘Electricity and Magnetism’? (5) What 
influence does a high school course in physics have on the achievement 
of students in college physics? 

The details of the conclusions drawn from the studies are, naturally, 
of special interest to college teachers of anatomy, physiology, and 
physics. The main finding likely to be of general interest is that 
class size in the courses investigated appears to have no influence on 
student achievement. In the words of the author, ‘‘ Achievement 
seems to be more a matter of individual incentive, capacity, and effort.” 

Dr. Hurd has nicely evaluated his work when he says, ‘The 
studies . . . find their greatest value in actual, concrete illustrations 
of techniques in educational experimentation. They represent 
attempts to settle problems of teaching by methods of experiment used 
in science.’”’ The discussion of the outstanding references in the 
excellent bibliography and the section of the conclusions on suggested 
techniques for this type of experimentation strike the reviewer as 
being particularly noteworthy. LEONARD B. WHEAT. 

Teachers College, Columbia University. 
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